Guaranteed 15% off your current AI inference bill for team spending up to $20000 / month.

Book a call →
Back to Blogs
AI Infrastructure

Building Speech-to-Text Systems with LLMs

Speech-to-text has moved beyond hidden Markov models and classical acoustic pipelines. Today, developers are building STT systems by combining robust audio...

Building Speech-to-Text Systems with LLMs

Speech-to-text has moved beyond hidden Markov models and classical acoustic pipelines. Today, developers are building STT systems by combining robust audio transcription models with large language models for post-processing, structured extraction, and conversational understanding. This hybrid approach delivers higher semantic accuracy and flexible output formats, but it also introduces new infrastructure challenges. You need an inference backend that handles both audio transcription and long-context LLM reasoning without unpredictable costs or cold-start latency.

The Two-Stage Architecture of Modern STT

Modern speech-to-text pipelines are typically split into two distinct stages. The first stage is acoustic transcription, where an audio model converts raw waveforms into text. The second stage is semantic refinement, where an LLM corrects transcription errors, adds punctuation, identifies speakers, or extracts structured data into JSON.

Classical ASR systems forced you to compress acoustic and language modeling into a single, monolithic pipeline. Modern STT decouples them. The acoustic model handles signal processing, while the LLM handles domain-specific language understanding. This decoupling means you can update your business logic without retraining the audio model, and you can audit each stage independently for debugging or compliance.

The Transcription Layer: Whisper and Beyond

For the acoustic stage, OpenAI's Whisper family remains the de facto standard. Oxlo.ai hosts Whisper Large v3, Whisper Turbo, and Whisper Medium through the audio/transcriptions endpoint. These models cover a range of latency and accuracy requirements, from real-time streaming to high-fidelity archival transcription.

Whisper Large v3 remains the go-to choice for maximum accuracy across dozens of languages. Whisper Turbo trades a small amount of fidelity for significantly lower latency, making it ideal for live captioning. Whisper Medium offers a middle ground when you need to conserve compute but still require reliable output. All three are available on Oxlo.ai with no cold starts, so you can dynamically select the right model per job without worrying about warmup penalties.

Because Oxlo.ai is fully OpenAI SDK compatible, switching your existing transcription code is a matter of changing the base URL.

import openai

client = openai.OpenAI(
    api_key="YOUR_OXLO_API_KEY",
    base_url="https://api.oxlo.ai/v1"
)

with open("meeting.wav", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=audio_file,
        response_format="text"
    )
print(transcript)

LLM Augmentation for Semantic Accuracy

Raw transcripts are rarely production-ready. They lack speaker labels, they omit punctuation, and they may contain homophone errors or hallucinated filler words. An LLM can fix this in a single pass, and it can simultaneously extract structured information such as action items, dates, or sentiment.

Beyond simple cleanup, LLMs can perform speaker diarization by analyzing semantic breaks and pronoun references, or they can normalize domain jargon into a controlled vocabulary. If you operate in regulated industries, you can route transcripts through DeepSeek R1 671B to perform complex reasoning over compliance checks, or through Qwen 3 32B for multilingual meetings. Oxlo.ai offers more than 45 models across seven categories, including general-purpose flagships like Llama 3.3 70B and long-context options like DeepSeek V4 Flash and Kimi K2.6.

Oxlo.ai's JSON mode ensures that extracted entities conform to your schema, while function calling lets the model trigger downstream tools such as calendar invites or CRM updates. Here is how you might pass a transcript into Llama 3.3 70B for structured extraction:

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "Extract action items as JSON."},
        {"role": "user", "content": transcript.text}
    ],
    response_format={"type": "json_object"}
)
print(response.choices[0].message.content)

Handling Long-Form Audio and Context Windows

Call center logs, podcast episodes, and lecture recordings can easily generate transcripts with tens of thousands of tokens. When your LLM stage is billed per token, long transcripts become prohibitively expensive. You are forced to choose between aggressive chunking, which destroys context, and ballooning costs.

When processing long audio, most developers resort to sliding-window chunking with overlap to stay within context limits. This introduces boundary errors where sentences are split across chunks, and it multiplies your API costs because each overlapping window incurs fresh token fees. With Oxlo.ai, you can send longer contiguous transcripts in fewer requests. Because the platform charges per request rather than per token, consolidating chunks into a single large prompt does not increase your bill.

Oxlo.ai uses request-based pricing: one flat cost per API request regardless of prompt length. Unlike token-based providers such as Together AI, Fireworks AI, OpenRouter, Replicate, and Anyscale, your cost does not scale with input length. A request containing a 100,000-token transcript costs the same as a request with a 100-token greeting. For these long-context workloads, Oxlo.ai's request-based pricing can be 10-100x cheaper than token-based alternatives.

For transcripts that exceed typical context limits, models like DeepSeek V4 Flash support a 1 million token context window, while Kimi K2.6 handles 131K tokens with advanced reasoning and vision capabilities. You can process entire long-form audio sessions in a single request without breaking your budget.

Implementing the Pipeline with Oxlo.ai

A complete STT system needs only two endpoints: audio/transcriptions and chat/completions. The following Python example ties both stages together using the OpenAI SDK. It transcribes an audio file and then uses an LLM to generate a structured meeting summary.

import openai
import json

client = openai.OpenAI(
    api_key="YOUR_OXLO_API_KEY",
    base_url="https://api.oxlo.ai/v1"
)

# Stage 1: Transcribe
with open("audio.wav", "rb") as f:
    transcription = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=f,
        response_format="verbose_json"
    )

# Stage 2: Structure
summary = client.chat.completions.create(
    model="qwen-3-32b",
    messages=[
        {
            "role": "system",
            "content": (
                "You are a meeting assistant. Return a JSON object with keys: "
                "summary, action_items, and decisions."
            )
        },
        {"role": "user", "content": transcription.text}
    ],
    response_format={"type": "json_object"}
)

result = json.loads(summary.choices[0].message.content)
print(json.dumps(result, indent=2))

Because the Oxlo.ai API is a drop-in replacement for the OpenAI SDK, you can reuse existing error handling, retry logic, and streaming parsers. If you need real-time feedback, enable streaming responses on the chat completion stage to deliver partial results as they are generated.

Cost Predictability at Scale

Speech-to-text workloads are unpredictable by nature. A five-minute customer call might yield 500 tokens, while a one-hour technical interview could yield 15,000. Under token-based billing, your monthly bill becomes a function of talk time and verbosity, which is difficult to forecast.

Oxlo.ai flips this model. You pay per request, so your costs scale with the number of audio files or conversations you process, not their length. The Free plan offers 60 requests per day across 16+ free models, which is enough to prototype a pipeline. The Pro plan provides 1,000 requests per day across all models, while Premium offers 5,000 requests per day with priority queue access. For high-volume deployments, the Enterprise plan includes dedicated GPUs and a guaranteed 30% cost reduction compared to your current provider. See https://oxlo.ai/pricing for current plan details.

This predictability makes Oxlo.ai especially attractive for agentic STT systems that chain multiple LLM calls per audio file. Each reasoning step, tool invocation, or validation pass is a single request with a fixed cost.

Evaluation and Production Hardening

Shipping an STT pipeline requires more than low cost. You need to measure word error rate, semantic similarity, and end-to-end latency. Oxlo.ai's lack of cold starts on popular models means your p99 latency is stable, which is critical for real-time captioning and voice agents.

Quantifying STT quality requires more than character-level metrics. You should compute word error rate against a ground-truth validation set, but you should also measure semantic similarity between the raw transcript and your LLM-refined output. Oxlo.ai offers embedding models such as BGE-Large and E5-Large through the embeddings endpoint, so you can generate vector representations of both versions and compute cosine similarity to ensure your LLM stage is not drifting or hallucinating. For latency-sensitive applications, monitor time-to-first-token on streaming responses. Oxlo.ai's consistent performance on popular models means you can set tight SLAs without provisioning overcapacity.

For bidirectional voice applications, you can pair Oxlo.ai's transcription models with Kokoro 82M text-to-speech on the audio/speech endpoint to build complete voice bots. Use multi-turn conversations to maintain state across an interaction, and leverage vision models such as Gemma 3 27B or Kimi VL A3B if your inputs eventually include video or screen sharing alongside audio.

Putting It Together

Building a modern speech-to-text system means orchestrating audio transcription with long-context LLM reasoning. The architecture is straightforward, but your choice of inference provider determines whether the system is affordable and responsive at scale.

Oxlo.ai provides the models, the OpenAI-compatible endpoints, and the request-based pricing structure that STT pipelines need. With Whisper variants for transcription, DeepSeek and Kimi models for long-context reasoning, and flat per-request pricing that insulates you from transcript length, Oxlo.ai is a natural backend for both prototype and production speech systems. Start with the Free tier, point your SDK to https://api.oxlo.ai/v1, and build your pipeline today.

Ready to build with Oxlo.ai?

Get started building high-performance AI inference applications today.

Get started
Ox Assistant
Online
OxBot
OxBot

Hi there! Try our cost calculator to see what you'd save with Oxlo.ai.