Guaranteed 15% off your current AI inference bill for team spending up to $20000 / month.

Book a call →
Back to Blogs
Learn AI

Vision Tasks and Image Processing with Oxlo.ai

I recently built a receipt parser for a finance team that was tired of manual data entry. Instead of chaining OCR APIs with fragile regex, I wrote a single...

Vision Tasks and Image Processing with Oxlo.ai

I recently built a receipt parser for a finance team that was tired of manual data entry. Instead of chaining OCR APIs with fragile regex, I wrote a single vision agent on Oxlo.ai that looks at an image and returns structured JSON. In this tutorial, I will walk you through the exact code so you can adapt it for invoices, inventory photos, or any visual document workload.

What you'll need

Before we start, gather the following:

  • An Oxlo.ai API key from https://portal.oxlo.ai
  • Python 3.10 or newer
  • The OpenAI SDK and Pillow: pip install openai Pillow

I will assume you have a folder of JPEG receipt images ready to process.

Step 1: Prepare the image payload

Vision models consume images as base64 data URIs. I resize any massive photos first so the payload stays lean without losing readability. This keeps network overhead low and ensures we stay well within the context window. I use Pillow because it handles orientation metadata correctly and lets us compress to JPEG on the fly.

import base64
import io
from pathlib import Path
from PIL import Image

def encode_image(image_path: str, max_dim: int = 1024) -> str:
    img = Image.open(image_path)
    
    # Downsample only if the image is larger than we need
    if max(img.size) > max_dim:
        img.thumbnail((max_dim, max_dim))
    
    buffer = io.BytesIO()
    img.save(buffer, format="JPEG", quality=85)
    b64 = base64.b64encode(buffer.getvalue()).decode("utf-8")
    return f"data:image/jpeg;base64,{b64}"

Step 2: Define the extraction schema

The system prompt is the contract. By forcing a strict JSON schema and banning markdown fences, we eliminate guesswork in the parsing stage. I treat this prompt as application config, and you should version control it. The more explicit you are about types and enums, the less the model will hallucinate fields.

SYSTEM_PROMPT = """You are a precise document parser. Look at the image and extract the following fields in valid JSON:
- merchant_name: string
- date: ISO 8601 string or null
- total_amount: number
- currency: three-letter code or null
- line_items: array of objects with {description: string, amount: number}
- category: one of ["meals", "travel", "office", "software", "other"]

Return only the JSON object. Do not wrap it in markdown fences."""

Step 3: Call the vision model

We send the image and prompt to Oxlo.ai using the standard OpenAI SDK. I use Kimi K2.6 because its 131K context window easily holds large images plus the system prompt. Because Oxlo.ai charges a flat rate per request instead of per token, processing a high-resolution scan costs the same as a tiny thumbnail. That makes batch workflows predictable, and you can see the exact rates on the Oxlo.ai pricing page.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

def parse_receipt(image_path: str) -> str:
    data_uri = encode_image(image_path)
    
    response = client.chat.completions.create(
        model="kimi-k2.6",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": [
                {"type": "text", "text": "Extract the receipt data."},
                {"type": "image_url", "image_url": {"url": data_uri}}
            ]},
        ],
        temperature=0.1,
    )
    return response.choices[0].message.content

Step 4: Parse and validate the response

Even with explicit instructions, models occasionally return markdown backticks. We strip them and parse strict JSON. If the payload is malformed, we let the exception bubble up so the batch logger can record the failure. I keep the regex minimal because the system prompt already forbids fences, but defense in depth saves you from a 3 a.m. pager.

import json
import re

def extract_json(raw: str) -> dict:
    raw = raw.strip()
    if raw.startswith("```"):
        raw = re.sub(r"^```(?:json)?\s*", "", raw)
        raw = re.sub(r"\s*```$", "", raw)
    return json.loads(raw)

def process_receipt(image_path: str) -> dict:
    raw_text = parse_receipt(image_path)
    return extract_json(raw_text)

Step 5: Batch process a folder

A real workflow processes hundreds of images. This loop sends each file to Oxlo.ai and appends the results to a JSONL file. On the Pro plan you get 1,000 requests per day, which is enough for a sizable backlog. If you need more volume, the Premium tier offers 5,000 requests per day with priority queueing. Because Oxlo.ai has no cold starts on popular models, the first image processes just as fast as the hundredth.

from pathlib import Path

INPUT_DIR = Path("./receipts")
OUTPUT_FILE = Path("./results.jsonl")

def batch_process():
    if not INPUT_DIR.exists():
        raise FileNotFoundError(f"Create {INPUT_DIR} and add JPEG images.")
    
    for img_path in sorted(INPUT_DIR.glob("*.jpg")):
        try:
            data = process_receipt(str(img_path))
            record = {"file": img_path.name, "extracted": data}
            
            with OUTPUT_FILE.open("a", encoding="utf-8") as f:
                f.write(json.dumps(record, ensure_ascii=False) + "\n")
            
            print(f"OK  {img_path.name}")
        except Exception as e:
            print(f"ERR {img_path.name}: {e}")

if __name__ == "__main__":
    batch_process()

Run it

Create a receipts directory, drop in a few JPEGs, and run the script. I recommend starting with five to ten images so you can spot-check the JSON before you scale up. The script appends to results.jsonl, so you can safely stop and restart without losing prior work.

python receipt_parser.py

You should see output like this:

OK  lunch_2024-03-12.jpg
OK  uber_2024-03-15.jpg
ERR hotel_2024-03-18.jpg: Expecting property name enclosed in double quotes

And results.jsonl will contain structured records:

{"file": "lunch_2024-03-12.jpg", "extracted": {"merchant_name": "Blue Bottle Coffee", "date": "2024-03-12", "total_amount": 18.5, "currency": "USD", "line_items": [{"description": "Latte", "amount": 5.5}, {"description": "Avocado Toast", "amount": 13.0}], "category": "meals"}}

Wrap-up and next steps

The agent is now a solid foundation for any visual document pipeline. Two concrete directions to take it next:

  1. Add a confidence score to the schema and route anything below 0.9 to a human review queue.
  2. Pipe the JSONL output directly into a database or an accounting API like QuickBooks so the finance team never touches a CSV again.

Oxlo.ai's OpenAI-compatible API and flat request pricing make it a natural fit for this kind of long-context vision workload. You can get started on the pricing page and have the parser running in minutes. If you adapt this for a different domain, like warehouse inventory or medical forms, the only thing you need to change is the system prompt and the output schema.

Ready to build with Oxlo.ai?

Get started building high-performance AI inference applications today.

Get started
Ox Assistant
Online
OxBot
OxBot

Hi there! Try our cost calculator to see what you'd save with Oxlo.ai.