
What we are building
Today we are building a model router that looks at a task description and picks the best Oxlo.ai model for the job. Instead of manually comparing context windows and benchmarks, you get a concrete recommendation in under a second. This is useful for teams running agentic workflows or long-context pipelines where model choice directly impacts output quality and cost. Because Oxlo.ai offers 45+ models across 7 categories under a single API key, a simple router lets you exploit that variety without maintaining multiple provider integrations.
What you'll need
- Python 3.10 or newer
- The OpenAI SDK:
pip install openai - An Oxlo.ai API key from https://portal.oxlo.ai
Step 1: Define the model catalog
I started by hard-coding a catalog of the models I actually use in production. Keeping this in a dictionary makes it easy to update when Oxlo.ai adds new checkpoints. I focus on four workhorses that cover most use cases: Llama 3.3 70B for general tasks, Qwen 3 32B for multilingual work, DeepSeek V3.2 for coding, and Kimi K2.6 for reasoning and long context. You can expand this list later with vision or audio models as needed. The key is to keep the descriptions short and actionable so the router has clear decision boundaries.
MODEL_CATALOG = {
"llama-3.3-70b": "General-purpose flagship. Balanced reasoning, writing, and instruction following.",
"qwen-3-32b": "Multilingual reasoning and agent workflows. Strong for non-English content.",
"deepseek-v3.2": "Coding and reasoning. Lightweight and available on the free tier.",
"kimi-k2.6": "Advanced reasoning, agentic coding, and vision. 131K context window.",
}
Step 2: Write the system prompt
The system prompt is the entire product. I tell the router to act as a classifier, restrict its output to a single model ID, and give it a short decision tree. I keep the temperature low so it does not get creative with formatting. I also add a fallback rule so it defaults to Llama 3.3 70B when the task is ambiguous. This prompt is the only part you need to tune if you expand the catalog later.
SYSTEM_PROMPT = """You are a model router. Given a user task and a catalog of models, return ONLY the model ID that best fits the task.
Catalog:
- llama-3.3-70b: General-purpose flagship. Balanced reasoning, writing, and instruction following.
- qwen-3-32b: Multilingual reasoning and agent workflows. Strong for non-English content.
- deepseek-v3.2: Coding and reasoning. Lightweight and available on the free tier.
- kimi-k2.6: Advanced reasoning, agentic coding, and vision. 131K context window.
Rules:
1. Return exactly one model ID from the list above.
2. Do not explain your choice.
3. If the task involves code, prefer deepseek-v3.2.
4. If the task is multilingual, prefer qwen-3-32b.
5. If the task requires vision or very long context, prefer kimi-k2.6.
6. For general tasks, default to llama-3.3-70b."""
Step 3: Build the router function
The router function is a standard chat completion call against Oxlo.ai. I point the OpenAI SDK at Oxlo.ai's base URL and use Llama 3.3 70B as the judge. I set max_tokens to 20 because the response should be nothing more than a short string. After the call, I strip whitespace and validate the output against our catalog dictionary. If the model hallucinates an ID, we fall back to the general-purpose flagship rather than crashing the script. This validation step is important in any automated pipeline.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
def select_model(task_description: str) -> str:
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Task: {task_description}"},
],
temperature=0.1,
max_tokens=20,
)
model_id = response.choices[0].message.content.strip()
if model_id not in MODEL_CATALOG:
return "llama-3.3-70b"
return model_id
Step 4: Dispatch the task
With the model ID in hand, the dispatcher fires a second request to the same Oxlo.ai endpoint. This is where Oxlo.ai's request-based pricing shines. On token-based providers, a long-context task would spike your bill on the execution call. With Oxlo.ai, the cost is flat per request, so adding a routing layer does not introduce unpredictable token overhead. I keep the second call simple: no system prompt, just the raw user task. If you need streaming, you can set stream=True on this second call and yield chunks directly to the user.
def run_task(task_description: str) -> str:
chosen_model = select_model(task_description)
print(f"Routing to: {chosen_model}")
response = client.chat.completions.create(
model=chosen_model,
messages=[
{"role": "user", "content": task_description},
],
)
return response.choices[0].message.content
Run it
I run three tasks that stress different capabilities. The Python refactor should land on DeepSeek V3.2. The Japanese email should trigger Qwen 3 32B. The log summary, which implies a large input, should route to Kimi K2.6 and its 131K context window. When you execute the script, you will see the routing decision printed before the result. This gives you immediate visibility into whether your prompt logic is working.
if __name__ == "__main__":
tasks = [
"Refactor this Python function to use list comprehensions.",
"Write a formal business email in Japanese requesting a meeting.",
"Analyze this 120K token server log and summarize the top errors.",
]
for task in tasks:
print(f"\nTask: {task}")
result = run_task(task)
print(f"Result snippet: {result[:200]}...")
Example output:
Routing to: deepseek-v3.2
Result snippet: Here is the refactored function using a list comprehension...
Routing to: qwen-3-32b
Result snippet: 件名: 打ち合わせのお願い...
Routing to: kimi-k2.6
Result snippet: The log analysis reveals three primary error categories...
Wrap-up and next steps
This pattern scales. You can swap the local dictionary for a SQLite database of model metadata, or add latency tracking so the router avoids busy checkpoints. Two concrete next steps: first, add a confidence score by asking the router to return a top-3 ranked list with brief justifications, then use the second-ranked model if the first times out. Second, cache routing decisions in Redis keyed by task embedding so you skip the routing call entirely on repeated workflows. Because Oxlo.ai charges a flat rate per request, you can experiment with these multi-step patterns without watching token meters spin up on long prompts. If you are currently on a token-based provider, the savings on long-context workloads alone can justify the architecture.

