Building Test Preparation Tools with LLMs

I built a lightweight test preparation agent that generates practice questions, scores answers, and explains mistakes in real time. It is useful for students cramming for exams and for developers embedding tutoring logic into edtech apps. In this tutorial I will walk through the exact code I shipped, using Oxlo.ai as the inference backend and the OpenAI SDK as the client.

What you'll need

You will need Python 3.10 or newer, the OpenAI SDK, and an Oxlo.ai API key.

pip install openai

Grab your key from https://portal.oxlo.ai. I keep mine in an environment variable so it never touches disk. Because Oxlo.ai is fully OpenAI-compatible, you will not need to learn a new SDK. You only change the base URL and API key.

Step 1: Set up the Oxlo.ai client

I start by importing the SDK and pointing it at Oxlo.ai. I use llama-3.3-70b as the default model because it handles general instruction following reliably, but you can swap the model string to qwen-3-32b, kimi-k2.6, or deepseek-v3.2 later without changing any other code. Oxlo.ai exposes all of them through the same endpoint.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ.get("OXLO_API_KEY")
)

Step 2: Write the system prompt

The system prompt is the entire curriculum for the agent. I define three distinct modes, output rules, and length limits so the model stays consistent across generate, evaluate, and explain calls. I keep the prompt verbose because Oxlo.ai uses request-based pricing, which means cost does not scale with input length. Adding a long rubric or a multi-turn history does not inflate the per-call cost, so you can iterate on prompt quality without token anxiety. See the exact plan details at https://oxlo.ai/pricing.

SYSTEM_PROMPT = """You are a concise test preparation tutor. You have three jobs:
1. Generate a practice question based on the topic and difficulty level (1-10).
2. Evaluate a student's answer and return valid JSON with fields: correct (boolean), score (0-100), feedback (string).
3. If the answer is wrong, explain the correct reasoning in one short paragraph.

Rules:
- Questions must be under 50 words.
- Feedback must point to a specific concept, not generic praise.
- Evaluation responses must contain only the JSON object, with no markdown fences.
"""

Step 3: Build the question generator

I create a helper that sends the topic and difficulty to the model and returns the question text. I set temperature to 0.7 so each study session feels fresh, and I cap max_tokens at 150 to keep responses short. If you want harder questions, you can append extra instructions like "include an edge case" to the user message without worrying about token cost.

def generate_question(topic: str, difficulty: int) -> str:
    user_msg = f"Generate one practice question about {topic} at difficulty {difficulty}/10."
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_msg},
        ],
        temperature=0.7,
        max_tokens=150,
    )
    return response.choices[0].message.content.strip()

Step 4: Build the answer evaluator

Scoring needs to be deterministic, so I drop temperature to 0.2. The model returns JSON that I parse into a Python dict. I strip any accidental markdown fences because some models wrap JSON in triple backticks when they are not explicitly forced into JSON mode. In production you should wrap the json.loads call in a try/except block, but for clarity I keep it direct here.

import json

def evaluate_answer(question: str, answer: str) -> dict:
    user_msg = (
        f"Question: {question}\n"
        f"Student answer: {answer}\n"
        "Evaluate the answer and respond with valid JSON only using the format: "
        '{"correct": bool, "score": int, "feedback": string}'
    )
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_msg},
        ],
        temperature=0.2,
        max_tokens=300,
    )
    raw = response.choices[0].message.content.strip()
    if raw.startswith("```"):
        raw = raw.split("```")[1].replace("json", "").strip()
    return json.loads(raw)

Step 5: Wire the interactive session loop

The loop ties the two helpers together. It prints the question, collects input, scores the answer, and conditionally requests an explanation. In a production app you would replace the input() call with an API endpoint, but the logic stays identical. If you want the tutor to remember earlier mistakes, append each turn to the messages list and pass the full array on the next request. Because Oxlo.ai charges per request rather than per token, that growing context remains cost-predictable.

def run_session(topic: str, difficulty: int):
    print(f"Starting session: {topic} (difficulty {difficulty})")
    question = generate_question(topic, difficulty)
    print(f"\nQuestion: {question}\n")

    user_answer = input("Your answer: ")

    result = evaluate_answer(question, user_answer)
    print(f"\nScore: {result['score']}/100")
    print(f"Feedback: {result['feedback']}")

    if not result['correct']:
        print("\n--- Explanation ---")
        explain_msg = f"Question: {question}\nExplain the correct answer clearly."
        response = client.chat.completions.create(
            model="llama-3.3-70b",
            messages=[
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": explain_msg},
            ],
            temperature=0.5,
            max_tokens=250,
        )
        print(response.choices[0].message.content.strip())

Run it

Save the script as tutor.py and run it. Here is a real session I recorded against Oxlo.ai using the Python data structures topic at difficulty 6.

if __name__ == "__main__":
    run_session("Python data structures", 6)

Terminal output:

Starting session: Python data structures (difficulty 6)

Question: In Python, what is the time complexity of inserting an element at the beginning of a list, and why?

Your answer: O(1) because lists are arrays

Score: 15/100
Feedback: Incorrect. Inserting at the beginning of a Python list requires shifting all existing elements, so it is not constant time.

--- Explanation ---
Inserting at index 0 of a Python list is O(n) because the underlying dynamic array must shift every existing element one position to the right to make room. If you need O(1) insertion at both ends, use collections.deque instead.

Next steps

Add a spaced repetition scheduler that reintroduces low-scoring topics after a delay. You can store the result dicts in SQLite, filter by a score threshold such as 70, and queue those topics for review in tomorrow's session.

If you are building for a multilingual audience, swap the model string to qwen-3-32b or kimi-k2.6. Both are available on Oxlo.ai under the same per-request pricing, so switching models does not change how you count costs. You can also try deepseek-v3.2 if you want to test against a strong coding and reasoning model on the free tier before scaling up.

Building Test Preparation Tools with LLMs

What you'll need

Step 1: Set up the Oxlo.ai client

Step 2: Write the system prompt

Step 3: Build the question generator

Step 4: Build the answer evaluator

Step 5: Wire the interactive session loop

Run it

Next steps

Related articles

Building Technical Writing Tools with LLM

Unlocking the Potential of LLM for Grant Writing

Leveraging LLM for Technical Writing

The Role of LLMs in Grant Writing: Opportunities and Challenges

Building Academic Writing Tools with LLMs: A Step-by-Step Guide

Unlocking the Potential of LLMs for Academic Writing

Ready to build with Oxlo.ai?