Guaranteed 15% off your current AI inference bill for team spending up to $20000 / month.

Book a call →
Back to Blogs
AI Infrastructure

Integrating LLM into Chatbots: A Step-by-Step Guide

Building a production chatbot around a large language model has shifted from a research problem to an engineering integration task. The core challenge is no...

Integrating LLM into Chatbots: A Step-by-Step Guide

Building a production chatbot around a large language model has shifted from a research problem to an engineering integration task. The core challenge is no longer pre-training or fine-tuning a model from scratch. Instead, you need to wire a capable foundation model into a reliable backend, manage conversation memory across turns, expose tools for agentic behavior, and keep costs predictable as user sessions grow. This guide walks through a practical, code-first approach to integrating an LLM into a chatbot. We will cover architecture, session state, function calling, deployment patterns, and cost control, with concrete examples you can run behind FastAPI, Express, or any stateless HTTP framework.

Architecture Overview

A maintainable chatbot splits into four distinct layers. The presentation layer is whatever interface your users touch: a web widget, a mobile app, or a messaging platform like Slack or Discord. The orchestration layer handles authentication, rate limiting, request validation, and routing. The memory layer persists conversation history and any user-specific context so the model can reference prior turns. Finally, the LLM backend generates responses, streams them to the client, and optionally emits tool calls.

You should keep your LLM client behind a thin abstraction or adapter. This insulates your business logic from provider-specific SDKs and lets you swap between general-purpose, reasoning, or coding models without rewriting prompts. Oxlo.ai fits naturally here because it is fully OpenAI-compatible. You can route requests to Llama 3.3 70B for general chat, DeepSeek R1 671B MoE for deep reasoning, Qwen 3 32B for multilingual agent workflows, or Kimi K2.6 for vision and coding tasks, all through identical SDK calls.

Choosing a Model Provider

Your provider choice shapes latency, model availability, and pricing structure. Most inference platforms bill per token. Under token-based pricing, every system prompt, retrieval-augmented document, and prior conversation turn increases cost. For chatbots, this is a structural problem. Each new user message resends the entire history, so the meter spins faster as sessions lengthen.

Oxlo.ai uses request-based pricing. You pay one flat cost per API request regardless of prompt length. For chatbots that carry long histories or run agentic loops, this can be significantly cheaper than token-based alternatives. The platform offers 45+ open-source and proprietary models across seven categories. General-purpose chat is well served by Llama 3.3 70B or GLM 5. Complex reasoning benefits from DeepSeek R1 671B MoE, Kimi K2 Thinking, or DeepSeek V4 Flash with its one-million-token context. Coding assistants can leverage Qwen 3 Coder 30B, Minimax M2.5, or DeepSeek V3.2. There are even vision and audio models if your chatbot needs to process images or voice.

Because Oxlo.ai is fully OpenAI SDK compatible, you can use the official Python or Node.js client libraries by changing only the base URL. There are no cold starts on popular models, so your chatbot responds immediately even after idle periods, which is critical for user experience.

Setting Up the API Client

The client setup requires a single configuration change. Instantiate the OpenAI SDK with the Oxlo.ai base URL and your API key, then begin streaming completions.

import openai

client = openai.OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

def chat_turn(messages, model="llama-3.3-70b"):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
    )
    reply = ""
    for chunk in response:
        content = chunk.choices[0].delta.content
        if content:
            reply += content
    return reply

Streaming is essential for chatbots. It reduces perceived latency and lets you render tokens as they arrive rather than waiting for the entire response. The function above accepts a message history list and returns the complete assistant reply, which you then append back into the session store. If you need structured output, you can add response_format={"type": "json_object"} to enforce JSON mode, which Oxlo.ai supports alongside standard text generation.

Managing Conversation State

LLMs are stateless. Every request must include the full conversation history, system instructions, and any retrieved context. In production you will persist this in Redis, Postgres, or an ephemeral store, but the in-memory pattern translates directly.

sessions = {}

def get_history(session_id):
    return sessions.setdefault(session_id, [
        {"role": "system", "content": "You are a helpful assistant."}
    ])

def run_turn(session_id, user_message):
    history = get_history(session_id)
    history.append({"role": "user", "content": user_message})
    assistant_reply = chat_turn(history)
    history.append({"role": "assistant", "content": assistant_reply})
    return assistant_reply

For long sessions, you must manage context window limits. Implement a sliding window that drops the oldest user-assistant pairs, or use a summarization step where a smaller model condenses early turns into a single system message. On Oxlo.ai, you can route summarization to fast, efficient models like DeepSeek V3.2 or Qwen 3 32B, while reserving larger models like GPT-Oss 120B or Kimi K2.6 for the main user-facing generation. This tiered approach keeps latency low and costs flat.

Handling Tool Use and Function Calling

Production chatbots rarely stop at text generation. They query APIs, execute code, search databases, or call internal microservices. Oxlo.ai supports function calling and tool use through the standard OpenAI schema, so you can define tools as JSON and let the model decide when to invoke them.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=history,
    tools=tools,
    tool_choice="auto",
)

message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        # Execute the function by name, append result to history
        pass

The flow is straightforward. Send the conversation and tool definitions. If the model returns one or more tool calls, execute them on your server, append each result as a new message with role tool and the corresponding tool_call_id, and send the updated history back to the LLM for the final answer. This loop works identically across Oxlo.ai models that expose tool-use capabilities, including agentic models like Minimax M2.5 and GLM 5.

Deploying and Scaling

Package the logic in a stateless server. A FastAPI or Express instance should treat each request as independent and fetch session history from an external store. This design lets you scale horizontally behind a load balancer without worrying about sticky sessions or in-memory replication.

Use async route handlers so that waiting on the LLM does not block your worker threads. If you use Python, openai.AsyncOpenAI pairs cleanly with FastAPI path operations. Set reasonable timeouts and implement retry logic with exponential backoff for transient failures.

Latency matters in conversational interfaces. Oxlo.ai eliminates cold starts on popular models, so your first request after a quiet period is as fast as any other. That consistency makes it easier to meet latency SLAs without over-provisioning dedicated capacity.

Cost Optimization Strategies

Chatbots are structurally expensive under token-based billing. Every turn resends the system prompt, any few-shot examples, the full conversation history, and retrieved documents. The input token count grows with each message, which means the longer the chat, the more you pay for context you have already transmitted.

Oxlo.ai changes the equation with flat per-request pricing. A request costs the same whether it contains a short greeting or thousands of tokens of history and retrieval context. That predictability makes budgeting simple and can be far cheaper for long-context and agentic workloads. You can also tier your models by task. Route simple intents and summarization to fast, low-cost-per-request models like DeepSeek V3.2 or Qwen 3 32B. Escalate complex reasoning, coding, or deep analysis to DeepSeek R1 671B MoE, Kimi K2 Thinking, or DeepSeek V4 Flash. Because the price is per request, you know the exact cost of each user turn before you send it. For exact plan details, see the Oxlo.ai pricing page.

Conclusion

Integrating an LLM into a chatbot is now a matter of clean architecture and the right provider. Abstract your client, manage state explicitly, handle tool calls through standard schemas, and choose a backend that keeps costs predictable as conversations lengthen. Oxlo.ai offers the breadth of models, OpenAI-compatible SDK, and request-based pricing that make it a genuinely strong fit for production chatbots. Start with the patterns above, point your client to https://api.oxlo.ai/v1, and iterate on behavior rather than infrastructure.

Ready to build with Oxlo.ai?

Get started building high-performance AI inference applications today.

Get started
Ox Assistant
Online
OxBot
OxBot

Hi there! Try our cost calculator to see what you'd save with Oxlo.ai.