Revolutionizing Computer Vision with LLMs

Most warehouse teams still rely on clipboard audits to track inventory and safety compliance. We are going to replace that with a lightweight Python agent that ingests a JPEG, analyses the scene, and returns structured JSON with pallet counts, forklift presence, and any visible hazards. The whole pipeline runs on Oxlo.ai using the standard OpenAI SDK, and because Oxlo.ai charges a flat rate per request instead of per token, sending a high resolution image does not inflate the cost.

What you'll need

You will need Python 3.10 or newer, the official OpenAI SDK (pip install openai), and an Oxlo.ai API key from https://portal.oxlo.ai. I also keep Pillow installed (pip install pillow) to verify image formats locally before burning an API call on a corrupted file. Grab any warehouse or stock-room photo and save it as warehouse.jpg in your working directory. If you do not have a real warehouse handy, a cluttered garage or storage closet works fine for this tutorial because the model is looking for generic industrial objects.

Step 1: Instantiate the Oxlo.ai client

Oxlo.ai exposes a fully OpenAI-compatible endpoint at https://api.oxlo.ai/v1, so the official openai Python package works without adapters or custom wrappers. That compatibility is useful because it means existing codebases can migrate by changing two lines, the base URL and the API key. For this pipeline I chose kimi-k2.6 because it supports vision, advanced reasoning, and a 131K context window. That combination matters when the model needs to simultaneously parse a cluttered industrial scene and follow a strict JSON schema. Oxlo.ai also hosts gemma-3-27b for vision tasks, but K2.6's reasoning layer handles ambiguous lighting and partially occluded objects more reliably. Another practical benefit is that Oxlo.ai keeps popular models warm, so the first request of the day returns just as quickly as the fiftieth.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ.get("OXLO_API_KEY")
)

MODEL = "kimi-k2.6"

Step 2: Encode the image for the multimodal payload

The OpenAI SDK accepts images as base64 data URIs inside the message content array. I wrote a small helper that opens a local file, confirms it is a JPEG or PNG, and returns the base64 string wrapped in the correct data URI scheme. Keeping this logic isolated means I can swap in a URL-based loader later without touching the inference code. I also keep the image under a few megabytes because large payloads still consume bandwidth even though Oxlo.ai's per-request pricing removes the token cost penalty. If you are batch-processing a directory of photos, you can encode them in parallel because the base64 step is CPU-bound while the inference step is network-bound.

import base64

def encode_image(image_path: str) -> str:
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def build_multimessage(image_path: str, text: str) -> dict:
    b64 = encode_image(image_path)
    ext = image_path.split(".")[-1].lower()
    mime = "image/jpeg" if ext in ["jpg", "jpeg"] else "image/png"
    return {
        "role": "user",
        "content": [
            {"type": "text", "text": text},
            {
                "type": "image_url",
                "image_url": {"url": f"data:{mime};base64,{b64}"}
            }
        ]
    }

Step 3: Lock down the system prompt

The system prompt is the contract between the engineer and the model. I tell it exactly what objects to look for, what JSON schema to return, and how to handle uncertainty. Each key serves a downstream purpose. Pallet counts feed inventory dashboards, forklift presence triggers traffic safety audits, and the hazards list drives compliance tickets. I also forbid markdown formatting so I do not have to strip backticks from the raw output, and I cap the notes field at one sentence to prevent the model from rambling. Keeping this prompt in a dedicated constant makes it easy to tweak detection criteria without touching business logic.

SYSTEM_PROMPT = """
You are a warehouse safety auditor.
Given an image of a warehouse floor, return a single JSON object with exactly these keys:
- pallet_count: integer, number of visible pallets.
- forklift_present: boolean.
- safety_hazards: list of strings describing any visible hazards such as blocked exits, loose cables, or spilled liquids. Use an empty list if none are visible.
- notes: string, one sentence on overall scene cleanliness.
- confidence: string, either "high", "medium", or "low".

Do not include markdown formatting, explanations, or bullet points. Return only the JSON object.
"""

Step 4: Call the vision model

Now we assemble the request. I pass the system prompt, the multimodal user message, and set response_format to JSON mode. This combination of vision input plus structured output is where Oxlo.ai's flat per-request pricing really changes the economics. On token-based providers, a high-resolution image can expand the prompt by tens of thousands of tokens. With Oxlo.ai, the cost stays the same whether the image is 100 KB or 5 MB, which makes this pattern practical for production agents that process dozens of floor photos every hour. I set temperature to 0.2 because extraction tasks benefit from consistency, not creativity. You do not need to micro-optimize output length or worry that a detailed photo will trigger unexpected charges.

import json

def analyze_warehouse(image_path: str) -> dict:
    user_message = build_multimessage(
        image_path,
        "Audit this warehouse snapshot and return the required JSON."
    )

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            user_message,
        ],
        response_format={"type": "json_object"},
        temperature=0.2,
    )

    raw = response.choices[0].message.content
    return json.loads(raw)

Step 5: Parse and validate the response

LLMs can hallucinate keys or return malformed numbers, so I add a thin validation layer that checks for required fields and ensures counts are integers. If anything is missing, I raise a clear error so the calling code can decide whether to retry with a different prompt or escalate the image to a human reviewer. In production, I also log the raw response string to an object store before parsing. That audit trail is invaluable when a floor manager asks why a specific photo was flagged, and it lets you detect schema drift if you later expand the system prompt with new fields.

REQUIRED_KEYS = {
    "pallet_count", "forklift_present", "safety_hazards", "notes", "confidence"
}

def validate_audit(result: dict) -> dict:
    missing = REQUIRED_KEYS - result.keys()
    if missing:
        raise ValueError(f"Missing keys in model response: {missing}")

    if not isinstance(result["pallet_count"], int):
        raise ValueError("pallet_count must be an integer")

    if result["confidence"] not in {"high", "medium", "low"}:
        raise ValueError("confidence must be high, medium, or low")

    return result

Run it

I saved a photo from our loading dock as warehouse.jpg and ran the script. The image shows eight wooden pallets stacked near a loading bay, a yellow forklift parked to the left, and a loose power cable snaking across the concrete. The call returned in under two seconds. Here is the complete entry point and the actual output I received.

if __name__ == "__main__":
    raw = analyze_warehouse("warehouse.jpg")
    validated = validate_audit(raw)
    print(json.dumps(validated, indent=2))

Example output:

{
  "pallet_count": 8,
  "forklift_present": true,
  "safety_hazards": [
    "loose cable near aisle 3"
  ],
  "notes": "Floor is generally tidy but cable management needs attention.",
  "confidence": "high"
}

Next steps

This agent works well as a standalone script, but its real value appears when it is wired into a larger system. Two concrete paths forward: deploy it behind a FastAPI endpoint so warehouse staff can upload photos from a mobile browser and receive instant JSON reports, or connect it to a Slack bot that alerts safety officers whenever a hazard list is non-empty. Because Oxlo.ai offers predictable per-request pricing, you can scale either approach without worrying that high-resolution images will explode your inference bill. If you are currently on a token-based provider, the switch is a literal base URL change. See the exact rates at https://oxlo.ai/pricing.

Revolutionizing Computer Vision with LLMs

What you'll need

Step 1: Instantiate the Oxlo.ai client

Step 2: Encode the image for the multimodal payload

Step 3: Lock down the system prompt

Step 4: Call the vision model

Step 5: Parse and validate the response

Run it

Next steps

Related articles

Applying LLM to Physics Research

Using LLM for Data Visualization

Building Data Analysis Tools with LLM

LLM-Powered Data Agents for Data Analysis

Optimizing LLMs for Data Analysis: A Cost Optimization Perspective

A Beginner's Guide to Using LLMs for Art Generation

Ready to build with Oxlo.ai?