
We are going to build a citation-aware research agent that accepts a topic, calls a search tool, and returns a structured markdown report. It is useful for analysts, developers, and product teams who need repeatable briefs without manual tab management. The entire pipeline runs against Oxlo.ai through the standard OpenAI SDK.
What you'll need
- Python 3.10 or newer
- The OpenAI SDK installed:
pip install openai - An Oxlo.ai API key from https://portal.oxlo.ai
Because Oxlo.ai exposes a standard OpenAI-compatible endpoint, you can use the official Python SDK rather than a custom client. Existing agent tutorials and framework code work with minimal changes. I will use llama-3.3-70b because it handles multi-turn tool use reliably, but you can drop in qwen-3-32b if you want stronger agent routing, kimi-k2.6 for long-context reasoning, or deepseek-v3.2 if you are experimenting on the free tier. Oxlo.ai uses flat per-request pricing, which means stuffing prior search results back into the context window does not inflate your bill the way token-based providers do.
Step 1: Configure the Oxlo.ai client
Create a new file named research_agent.py and set up the client. Using an environment variable for the key keeps credentials out of your source tree. Oxlo.ai serves all models with no cold starts on popular ones, so the first request after idle time returns immediately. I set the model string once so it is easy to swap later.
from openai import OpenAI
import json
import os
client = OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key=os.environ.get("OXLO_API_KEY")
)
MODEL = "llama-3.3-70b"
Step 2: Write the system prompt
The system prompt is the contract between you and the model. I make the rules explicit: ask for search when facts are missing, cite every claim, and format the output as markdown. This reduces hallucination and keeps the report skimmable. I treat the prompt as configuration. Keeping it in a top-level constant makes it easier to iterate without hunting through function bodies. The citation rule is the most important one. It forces the model to stay grounded in the snippets we return rather than relying on parametric knowledge.
SYSTEM_PROMPT = """You are a research assistant. Your job is to produce a structured markdown report on the user's topic.
Rules:
1. If you need external facts, call the search tool with a specific query.
2. After receiving search results, synthesize them into a coherent report.
3. Every claim must cite its source using [Source: title].
4. Output the final report in markdown with headings and bullet points.
5. Do not make up information. If search results are insufficient, say so."""
Step 3: Build a mock search tool
In production, this function would hit SerpAPI, Bing, or an internal knowledge base. For the tutorial, mock_search returns hardcoded snippets so the agent is fully runnable without extra API keys. We also define the function schema in OpenAI format so the model knows when to invoke it. The name, description, and parameter definitions are all the model sees when it decides whether to call the tool. A clear description improves routing accuracy.
def mock_search(query: str) -> list:
"""Simulate a web search returning snippets."""
db = {
"RAG enterprise search": [
{"title": "RAG Patterns 2024", "snippet": "Hybrid dense-sparse retrieval improves recall over pure vector search in enterprise settings."},
{"title": "Enterprise AI Report", "snippet": "Re-ranking retrieved chunks with a cross-encoder before LLM generation reduces hallucination rates."}
],
"vector database comparison": [
{"title": "Vector DB Benchmark", "snippet": "Partitioning strategies in pgvector and Pinecone show similar latency at sub-million scale."}
]
}
for key in db:
if key in query.lower():
return db[key]
return [{"title": "General", "snippet": "No direct match. RAG typically combines retrieval and generation to ground LLM outputs."}]
tools = [
{
"type": "function",
"function": {
"name": "search",
"description": "Search the web for factual information.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query."
}
},
"required": ["query"]
}
}
}
]
Step 4: Implement the tool loop
The agent works by sending the topic to the model and waiting. If the response contains a tool call, we execute mock_search, append the results as a tool message, and ask the model to continue. This loop repeats until the model returns the final report. The message list accumulates state. Each loop adds either an assistant message with tool_calls or a tool message with results. This conversation history is what lets the model reason about what it already knows and what it still needs to find. When the assistant stops emitting tool_calls, we know the report is ready. Because Oxlo.ai does not charge by the token, you can feed lengthy search results back into the conversation without watching metered costs climb.
def run_research(topic: str) -> str:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": f"Research this topic and produce a report: {topic}"}
]
while True:
response = client.chat.completions.create(
model=MODEL,
messages=messages,
tools=tools,
tool_choice="auto"
)
message = response.choices[0].message
if message.tool_calls:
tool_calls_payload = []
for tc in message.tool_calls:
tool_calls_payload.append({
"id": tc.id,
"type": tc.type,
"function": {
"name": tc.function.name,
"arguments": tc.function.arguments
}
})
messages.append({
"role": "assistant",
"content": message.content or "",
"tool_calls": tool_calls_payload
})
for tc in message.tool_calls:
if tc.function.name == "search":
args = json.loads(tc.function.arguments)
results = mock_search(args["query"])
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"name": "search",
"content": json.dumps(results)
})
else:
return message.content
return "No report generated."
Step 5: Format and display
The agent returns raw markdown. For now, we print it to the terminal. In a real deployment, you might save it to a file, render it with a markdown library, or stream it to a web interface.
if __name__ == "__main__":
topic = "Retrieval-Augmented Generation patterns for enterprise search"
report = run_research(topic)
print(report)
Run it
Export your Oxlo.ai API key and execute the script. The model will likely issue one or two search queries, wait for the synthetic results, and then emit the final markdown. If you switch to qwen-3-32b, you may notice different query strategies because its agent tuning emphasizes planning.
export OXLO_API_KEY="sk-oxlo.ai-..."
python research_agent.py
Example output:
# Retrieval-Augmented Generation Patterns for Enterprise Search
## Overview
Retrieval-Augmented Generation (RAG) combines dense retrieval with large language models to ground responses in private data.
## Key Patterns
- **Hybrid retrieval**: Using dense and sparse vectors together improves recall over pure vector search in enterprise settings [Source: RAG Patterns 2024].
- **Re-ranking**: Passing retrieved chunks through a cross-encoder before generation reduces hallucination rates [Source: Enterprise AI Report].
- **Latency considerations**: At sub-million scale, partitioning strategies across major vector databases show similar latency profiles [Source: Vector DB Benchmark].
## Gaps
- No specific benchmark data was found for billion-scale partitioning. Treat latency claims as architecture-dependent.
Next steps
Swap mock_search for a real search provider such as SerpAPI, Exa, or Tavily, and add extra tools for arXiv or Wikipedia lookups. If you start attaching long PDFs or multi-page search results to the conversation, the flat per-request pricing on Oxlo.ai keeps costs predictable even as your context grows. See the exact plan details at https://oxlo.ai/pricing.
Another practical upgrade is JSON mode. You can request the final report as structured JSON instead of markdown if you need to feed it into another pipeline. Oxlo.ai supports this through the standard response_format={"type": "json_object"} parameter. You can also parallelize tool calls by handling multiple tool_calls in a single turn, or switch to kimi-k2.6 if you need its larger 131K context window for massive source documents. The SDK call stays the same. Only the model string changes.


