Zero-Shot Learning with LLMs: Opportunities and Challenges

Unlike traditional ML, which requires hundreds of labeled examples per class, a zero-shot LLM classifier reasons from category descriptions alone. This makes it ideal for early-stage products where ticket volume is low and categories evolve monthly. We are going to build a zero-shot support ticket classifier that reads raw customer messages and assigns labels without any fine-tuning. I will walk through the exact code I run in production, using Oxlo.ai's flat per-request pricing so long tickets do not inflate costs.

What you'll need

Python 3.10 or newer
The OpenAI SDK: pip install openai
An Oxlo.ai API key from https://portal.oxlo.ai
An Oxlo.ai account (the free tier includes 60 requests per day, enough to prototype this classifier)

Step 1: Define the label space

I reached for zero-shot classification because my ticket volume is low and my categories change often. Building a training set would take months, and retraining is overkill. In zero-shot, the only signal the model receives is the description of each category. I keep these descriptions in a dictionary so my application logic and my prompt stay in sync. If I add a new label, I change it in one place. I also import json at the top because I will parse structured outputs later. The OpenAI SDK is the only dependency because Oxlo.ai exposes a fully compatible endpoint.

import json
from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

CATEGORIES = {
    "billing": "Payment failures, invoice questions, or refund requests.",
    "technical": "Bug reports, integration errors, or API outages.",
    "account": "Login issues, password resets, or role changes.",
    "general": "Feature requests, feedback, or anything else."
}

Step 2: Write the system prompt

The system prompt carries the entire burden of instruction. I explicitly tell the model it has never seen training examples, which reinforces that it must rely on the category definitions. I list the categories before the format instruction because LLMs anchor better when task definitions come before the output schema. I force raw JSON without markdown fences because I parse the output programmatically. I avoid asking for plain text like "The label is billing" because parsing unstructured sentences breaks when the model changes phrasing. JSON is brittle in a good way. If the keys are wrong, json.loads throws and my script fails fast. The confidence score is a self-evaluation that lets me decide whether to trust the prediction.

SYSTEM_PROMPT = """
You are a support ticket classifier. You have never seen training examples for these categories. Your job is to read the user's message and assign the most relevant label from the allowed list.

Allowed categories and their meanings:
- billing: Payment failures, invoice questions, or refund requests.
- technical: Bug reports, integration errors, or API outages.
- account: Login issues, password resets, or role changes.
- general: Feature requests, feedback, or anything else.

Respond ONLY with a JSON object in this exact format:
{
  "label": "",
  "confidence": ,
  "reasoning": ""
}

Do not include markdown code fences. Output raw JSON only.
"""

Step 3: Build the classifier function

I use Llama 3.3 70B because it handles instruction following and JSON mode consistently. If your tickets arrive in multiple languages, Qwen 3 32B or Kimi K2.6 are also strong options on Oxlo.ai. The key detail is the base URL. Pointing the OpenAI SDK at https://api.oxlo.ai/v1 is the only change needed to run this on Oxlo.ai. I set response_format to json_object, which saves me from regex-parsing markdown blocks. I keep temperature at 0.1 because classification is deterministic, and I avoid streaming because the response is small. Notice that I do not set max_tokens. For classification, the output is small, and the default limit on Oxlo.ai is sufficient. If you are classifying extremely long ticket threads, you might want to truncate or summarize the input, though Oxlo.ai's per-request pricing means you do not pay extra for sending the full thread.

def classify_ticket(text: str) -> dict:
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Classify this support ticket:\n\n{text}"},
        ],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    raw = response.choices[0].message.content
    return json.loads(raw)

Step 4: Add confidence gating

In production, I do not blindly trust every prediction. I add a confidence threshold and route low-confidence items to a human queue. A threshold of 0.75 is my starting point. In practice, I tune this by reviewing the human_review queue weekly. If too many tickets land there, I lower the threshold or improve the category descriptions. The human_review label is a first-class citizen in my routing logic. It triggers a Slack message instead of auto-assignment. This keeps me honest. Zero-shot models hallucinate less when they are allowed to abstain. I also wrap the confidence check in .get() so a malformed response does not crash the loop. The batch list below mixes clear and borderline cases so you can see how the gate behaves.

def classify_with_fallback(text: str, threshold: float = 0.75) -> dict:
    result = classify_ticket(text)
    if result.get("confidence", 0.0) < threshold:
        result["label"] = "human_review"
        result["reasoning"] = (
            f"Confidence {result['confidence']} below threshold {threshold}."
        )
    return result

tickets = [
    "I was charged twice for my Pro plan this month. Can I get a refund?",
    "The webhook endpoint returns a 500 error whenever a payment succeeds.",
    "I forgot my password and the reset email never arrives.",
    "Your platform changed my life. Please add dark mode.",
]

Run it

I run this from the command line with python classify.py. I limit the printed input to 60 characters so the terminal stays readable. The first call warms up the connection, and subsequent calls feel instant. Oxlo.ai does not cold-start popular models, so I do not see the latency spikes I used to get on serverless endpoints. Your exact confidence numbers may vary slightly due to sampling, but the labels should be stable at temperature 0.1. If you see inconsistent labels, raise the threshold or tighten the category descriptions.

if __name__ == "__main__":
    for ticket in tickets:
        print("INPUT:", ticket[:60] + "...")
        out = classify_with_fallback(ticket)
        print("OUTPUT:", json.dumps(out, indent=2))
        print()

When I run this against Oxlo.ai, the output looks roughly like this:

INPUT: I was charged twice for my Pro plan this month. Can I get ...
OUTPUT: {
  "label": "billing",
  "confidence": 0.94,
  "reasoning": "The user explicitly mentions a duplicate charge and requests a refund."
}

INPUT: The webhook endpoint returns a 500 error whenever a payment...
OUTPUT: {
  "label": "technical",
  "confidence": 0.91,
  "reasoning": "The user describes an API endpoint returning a server error code."
}

INPUT: I forgot my password and the reset email never arrives...
OUTPUT: {
  "label": "account",
  "confidence": 0.89,
  "reasoning": "The user cannot log in and is not receiving password reset emails."
}

INPUT: Your platform changed my life. Please add dark mode...
OUTPUT: {
  "label": "general",
  "confidence": 0.82,
  "reasoning": "The user provides positive feedback and requests a new feature."
}

Wrap-up and next steps

Zero-shot learning removes the data collection bottleneck, but it is not magic. The model can only distinguish categories that are clearly separable by natural language descriptions. If your taxonomy is nuanced, you will eventually need few-shot examples or a retrieval step. Oxlo.ai fits this workflow well because the flat per-request model makes experimentation cheap. You can iterate on prompts, add examples, and process long context without watching metered tokens accumulate. Compared to token-based providers, where a long ticket thread can cost more than the classification itself, Oxlo.ai keeps the cost predictable. This matters when you start attaching previous email chains or log files to the input.

Two concrete next steps. First, pipe this into your email or Slack ingestion queue and use Oxlo.ai's streaming responses to classify tickets in real time as they arrive. Second, once you accumulate a few hundred verified labels, append them as few-shot examples inside the system prompt. You still avoid fine-tuning, and you still benefit from flat per-request pricing. You can explore Oxlo.ai's pricing and tiers at https://oxlo.ai/pricing.

Zero-Shot Learning with LLMs: Opportunities and Challenges

What you'll need

Step 1: Define the label space

Step 2: Write the system prompt

Step 3: Build the classifier function

Step 4: Add confidence gating

Run it

Wrap-up and next steps

Related articles

The Future of Language Generation: Exploring the Potential of LLMs

Building a Language Translation Tool with LLMs: A Step-by-Step Guide

Unlocking the Power of LLMs for Machine Translation

The Role of LLMs in Mathematics

A Practical Guide to Using LLMs for Engineering

Unlocking LLM Potential for Engineering

Ready to build with Oxlo.ai?