
LLM inference is the act of sending a prompt to a trained model and receiving a generated response. Unlike training, which happens once in a large GPU cluster, inference is the recurring production workload that every user-facing AI feature relies on. In practice, inference means making an HTTP request from your script to a hosted model, then parsing the reply. We are going to build a support ticket triage agent that does exactly that. It reads raw customer messages, classifies them by urgency and category, and drafts a polite first reply. This saves small teams from manually sorting through a crowded inbox every morning. We will run it end-to-end on Oxlo.ai using the standard OpenAI SDK, because Oxlo.ai offers a flat per-request price and full compatibility with the code patterns you already know.
What you'll need
- Python 3.10 or newer installed locally
- The OpenAI SDK:
pip install openai - An Oxlo.ai API key from https://portal.oxlo.ai. The free plan gives you 60 requests per day, which is enough to test and iterate on this script many times over.
Step 1: Set up the Oxlo.ai client
Every inference request starts as an HTTP POST. Oxlo.ai hosts the model weights and GPU workers, so you do not need to download a 40GB weights file or configure CUDA on your laptop. You send text over HTTPS and get text back. Because Oxlo.ai is fully OpenAI-compatible, the Python import stays standard and the client handles retries, JSON parsing, and authentication for you. When you call create(), the SDK serializes your messages list to JSON, sends it to Oxlo.ai, and blocks until the model finishes generation. The response object contains the generated text in the same shape you would get from any OpenAI-compatible provider. I just point the base_url to Oxlo.ai and swap in my project key. Oxlo.ai also loads popular models with no cold starts, so the first request of the day returns just as fast as the hundredth.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
Step 2: Write the system prompt
The system prompt is the agent's job description. It is also the cheapest place to add context. Because Oxlo.ai charges per request rather than per token, I can make the system prompt as long and detailed as necessary without increasing the cost. I tell the model exactly what to output so I do not need fragile regex later. I ask for a JSON object with three fields: urgency, category, and draft_reply. I keep the allowed categories restrictive. If the model has too many options, it invents labels that do not match our internal taxonomy. A short enum in the prompt fixes that. Locking the output format to JSON means the rest of my Python code can treat the model like a typed function instead of a chatbot.
SYSTEM_PROMPT = """You are a support triage agent.
Analyze the customer message below.
Return ONLY a JSON object with this exact shape and no markdown formatting:
{
"urgency": "low" | "medium" | "high",
"category": "bug" | "billing" | "question" | "feature_request",
"draft_reply": "string"
}
Be concise. If the message mentions a crash, data loss, or security issue, set urgency to high."""
Step 3: Create the triage function
Now I wrap the API call in a small function. It takes the raw ticket text, sends it to Llama 3.3 70B on Oxlo.ai, and parses the JSON response. I use Llama 3.3 70B because it is a strong general-purpose model that follows system instructions reliably. If your tickets arrive in multiple languages, Qwen 3 32B is another good option on Oxlo.ai. Notice that the function signature looks like any other Python utility. That is the goal. Good inference code hides the network call behind a clean interface so the rest of your application does not need to know about tokens or temperature. One practical reality of LLM inference is that models occasionally ignore formatting instructions and wrap JSON in markdown fences. I strip those before parsing so the script does not crash on cosmetic syntax.
import json
def triage_ticket(ticket_text: str) -> dict:
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ticket_text},
],
)
raw = response.choices[0].message.content.strip()
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0].strip()
return json.loads(raw)
Step 4: Process a batch of tickets
Most support queues arrive in bursts. Here is a short loop that feeds four real-looking tickets to the agent and prints a formatted summary. In production, you would replace this list with rows from your helpdesk API or a webhook payload. Because Oxlo.ai does not suffer from cold starts, running a synchronous loop like this gives you steady latency. If you later need to process hundreds of tickets, you could parallelize with asyncio or a thread pool, but for a daily triage run I prefer the simplicity of a straight loop that is easy to debug. I also limit the printed ticket text to fifty characters so the terminal output stays readable. In a real dashboard, you would link to the full message instead.
tickets = [
"I was charged twice last month and I need a refund immediately.",
"How do I reset my password? I cannot find the link.",
"The app crashes every time I open the export dialog on Windows 11.",
"Do you have a roadmap for adding SSO support? It is a blocker for our team."
]
for t in tickets:
result = triage_ticket(t)
print(f"Ticket: {t[:50]}...")
print(f" Urgency: {result['urgency']}")
print(f" Category: {result['category']}")
print(f" Draft: {result['draft_reply']}")
print()
Step 5: Add error handling
Before I run this on real data, I want basic resilience. If the model returns malformed JSON, I want to log the failure and continue with the rest of the queue rather than killing the entire batch. The try/except catches both JSON decode errors and unexpected model output. When something breaks, I print the ticket snippet so I can reproduce the failure locally. I also keep the API key as a placeholder constant so it is obvious where to swap in your real credential. Here is the complete, runnable script.
import json
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
SYSTEM_PROMPT = """You are a support triage agent.
Analyze the customer message below.
Return ONLY a JSON object with this exact shape and no markdown formatting:
{
"urgency": "low" | "medium" | "high",
"category": "bug" | "billing" | "question" | "feature_request",
"draft_reply": "string"
}
Be concise. If the message mentions a crash, data loss, or security issue, set urgency to high."""
def triage_ticket(ticket_text: str) -> dict:
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ticket_text},
],
)
raw = response.choices[0].message.content.strip()
if raw.startswith("```"):
raw = raw.split("\n", 1)[1].rsplit("```", 1)[0].strip()
return json.loads(raw)
tickets = [
"I was charged twice last month and I need a refund immediately.",
"How do I reset my password? I cannot find the link.",
"The app crashes every time I open the export dialog on Windows 11.",
"Do you have a roadmap for adding SSO support? It is a blocker for our team."
]
for t in tickets:
try:
result = triage_ticket(t)
print(f"Ticket: {t[:50]}...")
print(f" Urgency: {result['urgency']}")
print(f" Category: {result['category']}")
print(f" Draft: {result['draft_reply']}")
except Exception as e:
print(f"Failed on ticket: {t[:50]}... Error: {e}")
print()
Run it
Save the complete script as triage.py. Export your Oxlo.ai key to the shell and run it.
export OXLO_API_KEY="sk-oxlo.ai-..."
python triage.py
You should see output similar to this. The draft replies are ready for human review before you send them.
Ticket: I was charged twice last month and I need a refund immedia...
Urgency: high
Category: billing
Draft: We have received your billing inquiry and are reviewing the duplicate charge. A refund will be processed within 24 hours.
Ticket: How do I reset my password? I cannot find the link....
Urgency: low
Category: question
Draft: Here is a direct link to reset your password. Let us know if you run into any issues.
Ticket: The app crashes every time I open the export dialog on Win...
Urgency: high
Category: bug
Draft: Thank you for reporting this crash. We are escalating this to engineering and will update you with a fix timeline shortly.
Ticket: Do you have a roadmap for adding SSO support? It is a blo...
Urgency: medium
Category: feature_request
Draft: Thanks for the feedback. SSO is on our roadmap for Q3. I will add your vote to the feature request and notify you when a beta is available.
Next steps
This agent is stateless, which is fine for a prototype, but production queues need memory. One concrete next step is to store results in a SQLite database. A simple schema with ticket text, urgency, category, draft_reply, and a timestamp lets you track trends and measure how many high-urgency tickets arrive per week. Another concrete upgrade is to wire the script into a Slack incoming webhook. You can post the high-urgency rows to an on-call channel so the team sees critical issues within seconds of the inference call returning.
If you want to experiment, swap the model string to deepseek-v3.2 or kimi-k2.6 to see how different reasoning styles affect the urgency classification. For tickets that include screenshots, you could switch to a vision-capable model like kimi-k2.6 or gemma-3-27b on Oxlo.ai and pass image URLs in the messages payload. Because Oxlo.ai charges a flat rate per request instead of per token, running long system prompts or processing verbose tickets does not inflate your cost. That makes it straightforward to iterate without watching a meter spin. See the details at https://oxlo.ai/pricing.

