
Speech synthesis is undergoing a fundamental architectural shift. For decades, text-to-speech systems relied on concatenative sample libraries, statistical parametric models, or deterministic neural pipelines that mapped text to mel-spectrograms. Today, large language models are beginning to treat speech not as a waveform to be reconstructed, but as a sequence of discrete tokens to be predicted. This change redefines latency trade-offs, context handling, and cost structures for developers building voice applications. Instead of cascading separate acoustic, duration, and vocoder models, a single transformer can model text and audio in a shared latent space, opening the door to zero-shot voice cloning, emotive prosody, and conversational turn-taking that were difficult to engineer into classical TTS pipelines.
From Spectrograms to Discrete Tokens
Classical neural TTS followed a well-established pattern. A text encoder processed phoneme or character sequences, a duration model predicted temporal alignment, and a vocoder such as HiFi-GAN converted mel-spectrograms into raw audio. Systems like Tacotron 2 and FastSpeech refined this pipeline, but each component remained specialized and brittle. Prosodic variation required explicit conditioning, and speaker adaptation demanded fine-tuning or separate speaker embeddings.
LLM-driven synthesis abandons the spectrogram intermediate entirely. Modern approaches use neural audio codecs, such as SoundStream, EnCodec, or SNAC, to compress waveforms into hierarchical discrete tokens. A language model then performs next-token prediction over these audio codes, conditioned on text embeddings or other audio context. Research systems like Voicebox, Audiobox, and SpiritLM demonstrated that when an LLM is trained on interleaved text and speech tokens, it learns implicit duration, intonation, and co-articulation without explicit supervision. The result is a unified generative model where changing speaking style is as simple as prefixing the context with a different audio prompt.
The Mechanics of Audio Token Prediction
The process begins with tokenization. A raw audio waveform is passed through a codec encoder running at a fixed frame rate, often between 25 Hz and 75 Hz, producing a sequence of vector-quantized codes. Because audio is dense, even a short utterance generates thousands of tokens. A ten-second clip at 50 Hz with a single codebook already yields 500 tokens. Many high-fidelity systems use multiple residual codebooks, which can expand the sequence length by a factor of four or more.
The LLM attends over a concatenated context of text tokens and audio tokens. During training, the model sees interleaved sequences of transcribed speech and raw audio codes, learning to predict the next audio token given prior text and prior audio. Autoregressive decoding produces the highest perceptual quality but suffers from sequential latency. To address this, researchers have explored non-autoregressive diffusion models, masked prediction, and speculative decoding. What matters for infrastructure is that the context window must accommodate not only the textual prompt, but also every audio token in the history of a conversation. For a voice agent maintaining a multi-turn dialogue, the effective prompt length can quickly exceed the context windows of smaller models, pushing developers toward larger, more expensive inference targets.
Context Windows and the Cost of Audio
Audio token sequences are among the longest inputs in modern multimodal AI. A five-minute customer-service conversation, when tokenized as audio, can consume tens of thousands of tokens before the LLM even begins generating a response. Under token-based pricing schemes, every second of audio context adds linear cost to the input. For agentic voice workflows that retain full acoustic history across multiple turns, this pricing model creates unpredictable bills that scale with conversation duration.
Oxlo.ai approaches this differently. As a developer-first inference platform, Oxlo.ai uses request-based pricing: one flat cost per API request regardless of prompt length. For long-context speech synthesis and agentic audio workloads, this can be 10-100x cheaper than token-based alternatives because cost does not scale with input length. Whether you are sending a short TTS prompt or a multi-turn voice conversation with thousands of audio tokens, the price per request remains flat. Developers can explore complex audio LLM prototypes without worrying that every additional second of context will inflate the invoice. For detailed plan information, see https://oxlo.ai/pricing.
Inference Infrastructure for Real-Time Voice
Conversational agents demand low time-to-first-token and streaming delivery. A user speaking to a voice assistant expects sub-second response latency, which means the inference stack cannot tolerate cold starts or queue stalls. Oxlo.ai offers streaming responses and no cold starts on popular models, giving voice applications consistent TTFT. This matters for both current pipeline architectures and future end-to-end audio LLMs.
While the industry transitions toward native speech models, production workloads today still rely on robust ASR and TTS endpoints. Oxlo.ai provides both: Whisper Large v3, Turbo, and Medium for transcription, and Kokoro 82M for fast, high-quality text-to-speech. These models are accessible through the same OpenAI-compatible API that powers the platform’s chat and reasoning models, so a voice agent can be built without managing multiple providers. The base URL is https://api.oxlo.ai/v1, and the platform is fully compatible with the OpenAI SDK in Python, Node.js, and cURL.
Building Voice Pipelines with Oxlo.ai
A practical voice agent typically chains three operations: transcribe user audio, generate a textual response, and synthesize reply speech. Because Oxlo.ai exposes audio transcriptions, chat completions, and audio speech endpoints through a single OpenAI-compatible schema, you can implement this pipeline in a few lines of Python.
from openai import OpenAI
client = OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
# 1. Transcribe incoming audio
with open("user_message.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=f
)
# 2. Generate a response with an LLM
chat = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": transcript.text}]
)
# 3. Synthesize speech with Kokoro
speech = client.audio.speech.create(
model="kokoro-82m",
voice="default", # choose from available voices
input=chat.choices[0].message.content
)
with open("reply.wav", "wb") as f:
f.write(speech.content)
This example uses Llama 3.3 70B for reasoning, but you can substitute Qwen 3 32B for multilingual agent workflows, DeepSeek R1 671B MoE for complex reasoning, or Kimi K2.6 for advanced coding and vision tasks. All models share the same authentication and endpoint structure. Oxlo.ai’s request-based pricing means that even if you pass a long system prompt or maintain a lengthy conversation history to preserve acoustic context, the cost per turn stays predictable. The Free plan offers 60 requests per day and access to 16+ free models, including a 7-day full-access trial, while Pro and Premium plans provide 1,000 and 5,000 requests per day respectively with priority queue access at the Premium tier.
The Road to End-to-End Speech LLMs
Today’s best practice is a cascade: ASR, then LLM, then TTS. Each stage introduces latency and error accumulation. End-to-end speech LLMs promise to collapse this stack into a single model that accepts audio tokens and emits audio tokens, reasoning over text and sound in a unified representation. Early research systems have shown that such models can handle paralinguistic signals, including laughter, sighs, and interruptions, without explicit classifiers. They can also perform zero-shot voice conversion and emotional style transfer by manipulating tokens in context.
However, end-to-end audio models place extreme pressure on inference infrastructure. They require massive context windows, efficient attention over long audio sequences, and low-latency streaming decoding. Oxlo.ai is architected for this trajectory. The platform’s flat per-request pricing removes the financial penalty for long audio contexts, and its OpenAI SDK compatibility means that when new audio-native models become available, they can be dropped into existing applications with minimal code changes. With 45+ models across LLMs, code, vision, audio, and embeddings, Oxlo.ai lets developers experiment across modalities without switching providers.
Evaluating Quality and Latency
Benchmarking LLM speech synthesis requires looking beyond traditional mean opinion scores. Objective metrics include character error rate for intelligibility, speaker similarity cosine scores for voice consistency, and real-time factor for generation speed. Subjective evaluation should probe conversational naturalness, appropriate prosody, and handling of rare proper nouns. Autoregressive audio LLMs often score higher on naturalness but lower on latency compared to frame-based neural TTS, though parallel decoding techniques are closing this gap.
For production deployments, consistency matters as much as peak quality. Oxlo.ai guarantees no cold starts on popular models, which means voice agents avoid the erratic latency spikes that break user trust. Streaming responses allow audio playback to begin before the full sequence is generated, a critical feature for interactive systems. As the field moves from research demos to production voice agents, these infrastructure properties separate experimental prototypes from reliable services.
The transition from pipeline-based speech synthesis to unified LLM architectures is already underway in research and early production systems. For developers, the priority is to build on infrastructure that supports both today’s cascaded audio stacks and tomorrow’s token-based audio models. Oxlo.ai offers a developer-first platform with request-based pricing, OpenAI SDK compatibility, and dedicated audio endpoints for transcription and speech. Whether you are shipping a voice agent today or prototyping an end-to-end audio LLM, Oxlo.ai provides the predictable costs and low-latency inference required for production voice workloads. Start building at https://oxlo.ai/pricing.

