
Retrieval-augmented generation (RAG) is the dominant pattern for grounding large language models in private, proprietary, or frequently updated data. Instead of fine-tuning a model on your documents, you parse them into chunks, store those chunks in a vector database, and retrieve the most relevant passages at query time. The retrieved text is then injected into the prompt, and the model synthesizes an answer conditioned on that evidence. The practical challenge is not just building the pipeline, it is controlling cost. Every retrieved chunk inflates the input prompt, and under token-based billing, a single RAG query with long context can become expensive. Oxlo.ai solves this with request-based pricing: one flat cost per API request regardless of prompt length. That makes Oxlo.ai a natural backend for RAG, where longer context directly improves answer quality. In this tutorial, we will build a complete RAG pipeline using Oxlo.ai for both embeddings and generation, with code you can run today.
What Is RAG and Why It Matters
RAG splits inference into two stages: retrieval and synthesis. During retrieval, a user query is converted into an embedding vector and matched against a vector store of document chunks. The top-k most similar chunks are fetched. During synthesis, those chunks are concatenated into a prompt, and a language model generates an answer that cites the provided evidence. This pattern reduces hallucinations, keeps sensitive data out of model training runs, and allows you to update knowledge simply by re-indexing documents.
The hidden cost driver is synthesis. A typical RAG prompt might include a system instruction, the original question, and five to ten document chunks, each hundreds of tokens long. If you are using a token-based provider, you pay for every token in that context window. In agentic RAG, where the system iterates retrieval and generation in a loop, those costs compound with each turn. Oxlo.ai flips this model by charging per request, not per token, so you can retrieve generously without watching a meter run.
Architecture Overview
A production RAG pipeline has three layers. Ingestion parses source documents, chunks them, computes embedding vectors, and writes them to a vector store. Retrieval accepts a query, embeds it, and performs a similarity search. Generation constructs a prompt from the query and retrieved chunks, then calls a chat model to produce the final output. Each layer has distinct latency and reliability requirements, but generation is usually the dominant cost center because token-based pricing scales with prompt length.
Oxlo.ai provides fully OpenAI-compatible endpoints for both the embedding and generation layers. You can use https://api.oxlo.ai/v1 as a drop-in replacement for your existing client, which means your RAG code requires no custom SDKs or vendor-specific response handling.
Choosing Your Embedding Model
Retrieval quality depends on your embedding model. Oxlo.ai offers BGE-Large and E5-Large through a standard OpenAI-compatible embeddings endpoint. These models produce dense vectors that capture semantic meaning across multilingual and domain-specific text, making them suitable for legal, technical, and general knowledge bases. Because Oxlo.ai uses request-based pricing, embedding a batch of chunks costs the same flat rate whether you send ten short sentences or ten long paragraphs. That is particularly useful during initial indexing, where you might process thousands of documents in bulk.
To get started, install the OpenAI SDK and point it at Oxlo.ai.
import openai
client = openai.OpenAI(
api_key="YOUR_OXLO_API_KEY",
base_url="https://api.oxlo.ai/v1"
)
response = client.embeddings.create(
model="bge-large",
input=["Oxlo.ai offers flat per-request pricing for long-context workloads."]
)
vector = response.data[0].embedding
If you are migrating from another provider, the only changes are the base URL and model identifier. The SDK, response shapes, and retry semantics remain identical.
Setting Up the Vector Store
For this tutorial we will use ChromaDB, an open-source vector database that runs in-process. The goal is to keep the example reproducible without requiring managed infrastructure. After chunking your documents, store each chunk alongside its embedding vector.
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="docs")
collection.add(
embeddings=[vector],
documents=["Oxlo.ai is a developer-first AI inference platform."],
ids=["chunk-0"]
)
In production, you might replace ChromaDB with Pinecone, Weaviate, or pgvector. The retrieval interface is standard, so the generation layer remains unchanged. That is where Oxlo.ai fits in.
Generating Answers with a Retrieval-Augmented Prompt
Once you have retrieved relevant chunks, you need a model that follows instructions, respects the provided context, and avoids hallucinating facts. Oxlo.ai hosts several models suited for RAG synthesis. Llama 3.3 70B is a strong general-purpose choice with broad tool compatibility. If your documents require deep reasoning, such as legal contracts or technical specifications, DeepSeek R1 671B MoE provides chain-of-thought capabilities that improve accuracy on complex retrieval tasks. For agentic RAG workflows, where the model decides to search again or call an external API, Qwen 3 32B supports multilingual reasoning and agent workflows.
The following example retrieves chunks and sends them to the Oxlo.ai chat completions endpoint. We use JSON mode to enforce a structured answer, which is helpful when you want to return citations alongside the response.
query = "How does Oxlo.ai pricing work for RAG?"
query_embedding = client.embeddings.create(
model="bge-large", input=query
).data[0].embedding
results = collection.query(
query_embeddings=[query_embedding], n_results=3
)
context = "\n\n".join(results["documents"][0])
completion = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{
"role": "system",
"content": "Answer using only the provided context. Cite sources."
},
{
"role": "user",
"content": f"Context:\