
Document summarization is the first LLM feature I add to any internal tool. In this guide, I will show you how to build a production-ready summarizer in Python that reads long text files and returns structured output using Oxlo.ai. Because Oxlo.ai charges a flat rate per request instead of per token, you can feed it multi-page reports or meeting transcripts without worrying that a longer input will inflate your bill. The whole script is under fifty lines and uses only the standard OpenAI SDK.
What you'll need
- Python 3.10 or newer installed on your machine.
- The OpenAI Python SDK. Install it with
pip install openai. - An Oxlo.ai API key from https://portal.oxlo.ai.
- A text file to summarize. I will generate a sample project report in the final step so you can run the script immediately.
Step 1: Load and chunk the document
I keep the loader dependency-free. It reads a UTF-8 text file and splits it into word chunks of roughly four thousand words each. I picked four thousand because it leaves ample headroom inside the context window of Llama 3.3 70B while keeping each chunk thematically coherent. Smaller chunks can fragment ideas across boundaries, while larger ones risk hitting limits if the document is verbose. If your files are shorter than that, the script returns a single chunk and skips the map-reduce logic entirely. This approach also means the code works the same whether you are summarizing a one-page brief or a fifty-page technical manual. You could swap the naive word split for a semantic chunker later, but for most use cases a simple split is good enough and avoids extra dependencies.
def load_and_chunk(filepath, chunk_size=4000):
with open(filepath, "r", encoding="utf-8") as f:
text = f.read()
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
Step 2: Design the system prompt
I treat the system prompt as an API contract. If I do not constrain the output format, parsing the result later becomes a nightmare. I ask for exactly three sections, no markdown outside them, and strict brevity. This makes post-processing predictable and lets me pipe the result into email templates, Slack bots, or a web UI without extra cleanup. You can adapt the sections to your domain, but keep the formatting rules strict so the model does not wander.
SYSTEM_PROMPT = """You are a precise document summarizer.
Read the provided text and produce a structured summary with exactly these sections:
1. Overview: One paragraph, max three sentences, describing what the document is about.
2. Key Points: A bulleted list of the five most important facts or arguments.
3. Action Items: A bulleted list of any tasks, deadlines, or decisions mentioned. If none, write "None found."
Be concise. Do not add preamble or markdown outside the requested sections."""
Step 3: Initialize the Oxlo.ai client and summarize a chunk
Oxlo.ai exposes a fully OpenAI-compatible endpoint, so I import the standard SDK and point it at https://api.oxlo.ai/v1. No custom adapters or hidden parameters are required. I use Llama 3.3 70B here because it follows formatting instructions reliably and runs without cold starts on Oxlo.ai. If you are working with multilingual source material, Qwen 3 32B is a strong alternative on the same endpoint. The function below takes a string, sends it to the model along with the system prompt, and returns the generated summary text. You could add a temperature parameter if you wanted more creative summaries, but for factual documents I leave it at the default.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
def summarize_chunk(text):
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": text},
],
)
return response.choices[0].message.content
Step 4: Chain summaries for long documents
When a document spans multiple chunks, I run a lightweight map-reduce pattern. First, I summarize each chunk individually. This is the map phase, and it is trivial to parallelize with a thread pool if you have many chunks. Then I concatenate those partial summaries with clear delimiters and run one final synthesis call. This is the reduce phase. On token-based providers, this second pass can be expensive because you pay for every token in the combined intermediate text. With Oxlo.ai, the reduce step is just one more flat request, which makes long-document workflows practical. You can see how this fits your budget on the Oxlo.ai pricing page at https://oxlo.ai/pricing. If you find yourself reducing ten or more partial summaries, consider switching the reduce model to DeepSeek V4 Flash on Oxlo.ai. Its one-million-token context window can swallow a massive bundle of intermediates in a single shot, so you might even skip chunking entirely for all but the largest books.
def summarize_document(filepath):
chunks = load_and_chunk(filepath)
# Map step
partial_summaries = [summarize_chunk(chunk) for chunk in chunks]
if len(partial_summaries) == 1:
return partial_summaries[0]
# Reduce step
combined = "\n\n---\n\n".join(partial_summaries)
reduce_prompt = (
"You are given partial summaries of a longer document. "
"Synthesize them into a single coherent summary using the same structured format: "
"Overview, Key Points, Action Items.\n\nPartial summaries:\n" + combined
)
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": reduce_prompt},
],
)
return response.choices[0].message.content
Step 5: Run it
Here is a complete end-to-end test. I write a dummy quarterly review to disk, then feed it through the pipeline. In practice you would point this at exported emails, transcribed meeting notes, or scraped web pages. The script handles the chunking automatically, calls Oxlo.ai, and prints the final structured summary.
if __name__ == "__main__":
sample_text = """
Project Phoenix Quarterly Review
The board approved a 12% budget increase to cover cloud migration costs.
Launch was delayed from October to November due to load-testing requirements.
The mobile team hit 98% feature parity with the web client.
Customer beta feedback scores averaged 4.2 out of 5.
A security audit flagged two medium-priority issues in the auth service.
Sarah will finalize vendor contracts by Friday.
Engineering must submit load-test results by October 15.
The security team will re-scan the auth service before launch.
"""
with open("report.txt", "w", encoding="utf-8") as f:
f.write(sample_text)
result = summarize_document("report.txt")
print(result)
When I run the script, the output looks like this:
Overview
This quarterly review covers Project Phoenix status, budget changes, and revised launch timelines for the engineering and product teams.
Key Points
- The board approved a 12% budget increase to cover cloud migration costs.
- Launch was delayed from October to November due to load-testing requirements.
- The mobile team hit 98% feature parity with the web client.
- Customer beta feedback scores averaged 4.2 out of 5.
- A security audit flagged two medium-priority issues in the auth service.
Action Items
- Sarah to finalize vendor contracts by Friday.
- Engineering to submit load-test results by October 15.
- Security team to re-scan the auth service before launch.
Next steps
Swap the reduce model to Kimi K2.6 or DeepSeek V4 Flash if you want to collapse larger documents without chunking. Both are available on Oxlo.ai and support 131K and 1M token contexts respectively, so you can often drop an entire white paper into a single request. You can also turn on JSON mode by passing response_format={"type": "json_object"} in the chat completion call and rewriting the system prompt to request JSON. That gives you a machine-readable dict you can pipe straight into a database or Slack notification.

