Guaranteed 15% off your current AI inference bill for team spending up to $20000 / month.

Book a call →
Back to Blogs
Learn AI

LLM Explained for Beginners

We're going to build a support ticket triage agent that classifies incoming messages and drafts replies. It is a self-contained project that teaches how large...

LLM Explained for Beginners
We're going to build a support ticket triage agent that classifies incoming messages and drafts replies. It is a self-contained project that teaches how large language models handle instructions, generate text, and manage conversation context, all without needing a machine learning background. By the end, you will have a working Python script that you can run against Oxlo.ai's API.

What you'll need

  • Python 3.10 or newer installed locally.
  • The OpenAI SDK, which acts as a universal client. Install it with pip install openai.
  • An Oxlo.ai API key from https://portal.oxlo.ai. Oxlo.ai hosts open-source models behind a flat per-request pricing model, so you can iterate on long prompts and multi-turn conversations without the cost scaling on every extra token.

Step 1: Make your first API call

An LLM is fundamentally a token prediction engine. You feed it a sequence of messages, and it generates the most probable next tokens until it decides to stop. To see this in action, we point the OpenAI SDK at Oxlo.ai and send a single user message. The chat.completions endpoint is the standard interface for this.
from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "Explain what an LLM is in one sentence."},
    ],
)

print(response.choices[0].message.content)
If you see a coherent answer printed to your terminal, you have confirmed three things: your API key is valid, the client is routing to Oxlo.ai, and the model is generating a sequence of tokens in response to your prompt. There is no local GPU or model download required.

Step 2: Define the system prompt

The messages array has roles. The system role is invisible to the end user but shapes every token the model produces. It is where you set the personality, constraints, and output format. For our agent, we need strict rules so the model behaves like a deterministic tool rather than a chatty assistant.
SYSTEM_PROMPT = """You are a senior support agent.
1. Classify the user's ticket into one of: Billing, Technical, Account.
2. Draft a polite, concise response of no more than two sentences.
3. Output valid JSON with exactly two keys: "category" and "reply"."""
This prompt is doing heavy lifting. It restricts the model to three possible categories, enforces brevity, and mandates valid JSON. Without these guardrails, the model might ramble or invent categories.

Step 3: Build the ticket handler

Now we wrap the API call inside a function that accepts raw ticket text and returns structured Python data. We enable JSON mode so the model knows it must produce valid JSON. This removes the need for fragile regex parsing.
import json
from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

SYSTEM_PROMPT = """You are a senior support agent.
1. Classify the user's ticket into one of: Billing, Technical, Account.
2. Draft a polite, concise response of no more than two sentences.
3. Output valid JSON with exactly two keys: "category" and "reply"."""

def handle_ticket(ticket_text: str):
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": ticket_text},
        ],
        response_format={"type": "json_object"},
    )
    raw = response.choices[0].message.content
    return json.loads(raw)

# Test it
ticket = "I was charged twice last month and need a refund."
result = handle_ticket(ticket)
print(result)
The function constructs the messages list fresh on every call. The system prompt sits at index zero, establishing the rules. The user message follows, carrying the actual ticket. When response_format is set to json_object, the model's output is guaranteed to be parseable by Python's json.loads. Oxlo.ai supports this natively across its chat models, so you do not need to add "output JSON" hacks to the prompt itself.

Step 4: Stream the response

If you run the handler above on a slow connection, you will notice a delay before the full JSON appears. That happens because the model generates one token at a time, and the client waits for the entire sequence to finish. Streaming changes the delivery mechanism. It returns each chunk as it is produced, letting you print characters as they arrive.
def draft_reply_stream(ticket_text: str):
    stream = client.chat.completions.create(
        model="qwen-3-32b",
        messages=[
            {"role": "system", "content": "You are a helpful support agent. Draft a reply."},
            {"role": "user", "content": ticket_text},
        ],
        stream=True,
    )
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end="", flush=True)
    print()

draft_reply_stream("My login fails every morning at 9 AM.")
Streaming does not alter the model's reasoning or the final text. It only changes the HTTP transfer encoding. For a beginner, it is a good way to visualize that an LLM is not retrieving a pre-written paragraph from a database. It is literally predicting the next word over and over. I used qwen-3-32b here to show that Oxlo.ai exposes multiple models under the same client and endpoint.

Step 5: Give the agent memory

A common misconception is that LLMs remember your previous requests. They do not. Each API call is stateless. If you want the model to know that a user just asked about a password reset before asking about email updates, you must maintain the history yourself and resend it in the messages array. This list of past turns is called the context window.
history = [
    {"role": "system", "content": SYSTEM_PROMPT},
]

def handle_ticket_with_memory(ticket_text: str):
    history.append({"role": "user", "content": ticket_text})
    
    response = client.chat.completions.create(
        model="deepseek-v3.2",
        messages=history,
        response_format={"type": "json_object"},
    )
    
    assistant_msg = response.choices[0].message.content
    history.append({"role": "assistant", "content": assistant_msg})
    return json.loads(assistant_msg)

# First ticket
print(handle_ticket_with_memory("I forgot my password."))

# Follow-up that references the previous message
print(handle_ticket_with_memory("Actually, I also cannot update my email."))
In this pattern, history acts as a simple memory buffer. Every user message and every assistant reply is appended to the list before the next call. The model then uses that full sequence as its working memory. On token-based platforms, every extra word in that history increases your bill. Oxlo.ai uses request-based pricing, so adding conversation history to the prompt does not change what you pay. That makes it practical to build stateful agents with long context without watching a token meter.

Run it

Here is the complete script. Save it as support_agent.py, replace YOUR_OXLO_API_KEY, and run python support_agent.py.
import json
from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

SYSTEM_PROMPT = """You are a senior support agent.
1. Classify the user's ticket into one of: Billing, Technical, Account.
2. Draft a polite, concise response of no more than two sentences.
3. Output valid JSON with exactly two keys: "category" and "reply"."""

def handle_ticket(ticket_text: str):
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": ticket_text},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

if __name__ == "__main__":
    tickets = [
        "I was charged twice last month and need a refund.",
        "The API returns a 500 error when I post to /v1/upload.",
        "I want to close my account permanently.",
    ]
    for t in tickets:
        result = handle_ticket(t)
        print(f"Ticket: {t}")
        print(f"Category: {result['category']}")
        print(f"Reply: {result['reply']}\n")
Example output:
Ticket: I was charged twice last month and need a refund.
Category: Billing
Reply: I have flagged the duplicate charge for review. You should see the refund within 3 to 5 business days.

Ticket: The API returns a 500 error when I post to /v1/upload.
Category: Technical
Reply: A 500 error indicates a server-side issue. Please retry with the request ID so we can trace the exact failure.

Ticket: I want to close my account permanently.
Category: Account
Reply: I can help you close your account. Please confirm your user ID so I can proceed securely.

Next steps

Swap llama-3.3-70b for kimi-k2.6 if you need stronger reasoning on ambiguous tickets, or try deepseek-r1-671b when you want explicit chain-of-thought reasoning before the final answer. If you plan to run this against a real support queue, evaluate Oxlo.ai's pricing at https://oxlo.ai/pricing. The flat per-request structure keeps costs predictable even when you stuff the prompt with long documentation pages or full conversation threads.

Ready to build with Oxlo.ai?

Get started building high-performance AI inference applications today.

Get started
Ox Assistant
Online
OxBot
OxBot

Hi there! Try our cost calculator to see what you'd save with Oxlo.ai.