Guaranteed 15% off your current AI inference bill for team spending up to $20000 / month.

Book a call →
Back to Blogs
Engineering

Fine-Tuning LLM Models: A Step-by-Step Guide

Most teams who think they need fine-tuning actually need consistent reasoning on a narrow domain. In this guide we will build a support ticket triage agent...

Fine-Tuning LLM Models: A Step-by-Step Guide

Most teams who think they need fine-tuning actually need consistent reasoning on a narrow domain. In this guide we will build a support ticket triage agent that classifies urgency and drafts a first response, using long few-shot prompts on Oxlo.ai. Because Oxlo.ai charges a flat rate per request instead of per token, we can embed dozens of examples in the context window and still pay the same price as a single-line prompt, which makes iterative prompt engineering cheaper than managing a training pipeline.

What you'll need

Step 1: Configure the Oxlo.ai client

I start every project with a thin wrapper around the Oxlo.ai client. Pointing the OpenAI SDK at Oxlo.ai is a one-line base_url change, so existing code migrates instantly.

from openai import OpenAI
import json

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"  # get yours at https://portal.oxlo.ai
)

Step 2: Build the few-shot example bank

I collected five representative support tickets and the exact JSON I expect. These replace the need for a fine-tuned model by anchoring the output format and tone. On Oxlo.ai, adding this context does not raise the per-request price, so I do not have to trade example depth for cost.

FEW_SHOT_EXAMPLES = [
    {
        "ticket": "Subject: Login error\nI cannot access my dashboard since this morning. I get a 403 every time.",
        "output": json.dumps({
            "urgency": "high",
            "category": "auth",
            "response": "I am sorry you are locked out. I have reset your session. Please clear cookies, log in again, and let me know if the 403 persists."
        })
    },
    {
        "ticket": "Subject: Invoice\nCan I get a PDF of last month's invoice?",
        "output": json.dumps({
            "urgency": "low",
            "category": "billing",
            "response": "Attached is your invoice for last month. Let me know if you need anything else."
        })
    },
    {
        "ticket": "Subject: API latency\nOur integration calls are taking 8 seconds today. Is there an outage?",
        "output": json.dumps({
            "urgency": "high",
            "category": "technical",
            "response": "We detected elevated latency in the US-East region. The issue is resolved as of 14:30 UTC. Please retry and confirm."
        })
    },
    {
        "ticket": "Subject: Team seats\nHow do I add two more seats to our Pro plan?",
        "output": json.dumps({
            "urgency": "low",
            "category": "billing",
            "response": "You can add seats from Settings > Billing. I have sent a direct link to your admin email."
        })
    }
]

Step 3: Write the system prompt

The system prompt enforces JSON mode and sets the persona. Keeping it in a dedicated constant makes A/B testing easy when you iterate on tone or schema.

SYSTEM_PROMPT = """You are a senior support triage agent.
For every incoming ticket, produce a single JSON object with exactly these keys:
- urgency: either "low", "medium", or "high"
- category: one of "auth", "billing", "technical", or "general"
- response: a polite, concise first response under 50 words

Follow the format and tone shown in the examples. Output only valid JSON."""

Step 4: Create the inference function

This helper concatenates the few-shot examples, appends the new ticket, and calls Llama 3.3 70B through Oxlo.ai. I use Llama 3.3 70B because it follows long structured prompts accurately, and there are no cold starts, so the first call of the day is just as fast as the hundredth.

def triage_ticket(ticket_text: str) -> dict:
    parts = []
    for ex in FEW_SHOT_EXAMPLES:
        parts.append(f"Ticket:\n{ex['ticket']}\nOutput:\n{ex['output']}")
    parts.append(f"Ticket:\n{ticket_text}\nOutput:")
    user_message = "\n\n".join(parts)

    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ],
    )
    raw = response.choices[0].message.content
    return json.loads(raw)

Step 5: Batch test the agent

I run three synthetic tickets through the agent to check classification accuracy and response tone before hooking it into a real queue.

if __name__ == "__main__":
    test_tickets = [
        "Subject: Database down\nOur API calls are timing out and the status page shows red.",
        "Subject: Feature request\nCan you add dark mode to the dashboard?",
        "Subject: Overcharge\nI was billed twice for the Pro plan this month.",
    ]

    for t in test_tickets:
        result = triage_ticket(t)
        print(json.dumps(result, indent=2))

Run it

Save the script as triage.py, export your YOUR_OXLO_API_KEY into the client line, and run python triage.py. Oxlo.ai returns structured output immediately. Because there are no cold starts on Llama 3.3 70B, the first request warms up in under a second.

Example output:

{
  "urgency": "high",
  "category": "technical",
  "response": "I have escalated this to our infrastructure team. We will update the status page within 15 minutes."
}
{
  "urgency": "low",
  "category": "general",
  "response": "Thanks for the suggestion. I have added dark mode to our public roadmap."
}
{
  "urgency": "medium",
  "category": "billing",
  "response": "I see the duplicate charge. I have issued a refund; it should appear in 3 to 5 business days."
}

Wrap up and next steps

This pattern scales well. Because Oxlo.ai uses request-based pricing, you can grow the few-shot bank to twenty or thirty examples without increasing cost per call. That makes it practical to iterate on behavior daily instead of waiting for a training run to finish. You can even test the workflow on the Oxlo.ai free tier before committing. See https://oxlo.ai/pricing for current plan details.

Two concrete next steps: wire the agent into your CRM via Oxlo.ai's function calling support so it can look up order status in real time, or swap in Qwen 3 32B if you need the same logic in Mandarin or Spanish without rewriting the prompt.

Ready to build with Oxlo.ai?

Get started building high-performance AI inference applications today.

Get started
Ox Assistant
Online
OxBot
OxBot

Hi there! Try our cost calculator to see what you'd save with Oxlo.ai.