Building Environmental Science Tools with LLMs: A Tutorial

We're building an environmental compliance analyzer that ingests water quality sensor data and flags EPA Clean Water Act violations. It helps field technicians and environmental engineers turn raw sensor logs into structured compliance reports without maintaining complex rule engines. You can run this on Oxlo.ai with any of their reasoning models, and the flat request pricing keeps costs predictable even when you feed in long parameter histories.

What you'll need

You need Python 3.10 or newer, the OpenAI SDK, and an API key from Oxlo.ai. Sign up at https://portal.oxlo.ai and grab your key from the dashboard. Install the SDK with pip. I also assume you have a basic understanding of JSON and water quality parameters, though the prompt handles the regulatory logic for you.

pip install openai

Step 1: Configure the Oxlo.ai client

I always start by initializing the client as a drop-in replacement for the OpenAI SDK. Oxlo.ai exposes a fully compatible endpoint at https://api.oxlo.ai/v1, so the only changes are the base URL and your Oxlo.ai API key. This means you can use existing Python tooling without vendor lock-in. I import json as well because the agent will return structured reports.

from openai import OpenAI
import json

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"  # Get this from https://portal.oxlo.ai
)

Step 2: Define the environmental science system prompt

The system prompt anchors the model in the correct regulatory domain. I constrain it to EPA methodologies, require citations to specific CFR parts, and force JSON output so I can pipe the results into downstream dashboards or SCADA systems. Keeping the prompt explicit reduces hallucination on limit values, which matters when a false negative could mean a missed discharge violation. I store it as a module-level constant so I can version it alongside the code.

SYSTEM_PROMPT = """You are a senior environmental compliance analyst specializing in the Clean Water Act.
You receive water quality sensor readings and determine compliance status against EPA National Recommended Water Quality Criteria.

For each reading, evaluate:
- pH: acceptable range 6.5 to 8.5
- Dissolved Oxygen: minimum 5.0 mg/L for warm-water aquatic life
- Turbidity: maximum 10 NTU above background
- Temperature: cannot exceed 32 degrees C

Respond ONLY in valid JSON with this structure:
{
  "site_id": "string",
  "overall_status": "compliant" or "violation",
  "findings": [
    {
      "parameter": "string",
      "value": number,
      "unit": "string",
      "limit": "string",
      "status": "pass" or "fail",
      "regulatory_reference": "string"
    }
  ],
  "recommended_actions": ["string"]
}

Do not include markdown formatting or explanations outside the JSON."""

Step 3: Format sensor data for the model

Field sensors output messy CSVs with inconsistent units and missing timestamps. I wrote a small formatter that normalizes a dictionary of readings into a consistent text block. This keeps the LLM prompt deterministic and easy to debug if a particular reading gets misclassified. Separating data cleaning from the LLM call also means I can swap in live MQTT streams later without touching the inference layer.

def format_sensor_payload(site_id, readings):
    lines = [f"Site ID: {site_id}", "Sensor Readings:"]
    for param, data in readings.items():
        lines.append(f"- {param}: {data['value']} {data['unit']} (recorded {data['timestamp']})")
    return "\n".join(lines)

# Example field data
sample_readings = {
    "pH": {"value": 8.9, "unit": "standard units", "timestamp": "2024-05-14T09:00:00Z"},
    "Dissolved Oxygen": {"value": 4.2, "unit": "mg/L", "timestamp": "2024-05-14T09:00:00Z"},
    "Turbidity": {"value": 14.3, "unit": "NTU", "timestamp": "2024-05-14T09:00:00Z"},
    "Temperature": {"value": 29.5, "unit": "degrees C", "timestamp": "2024-05-14T09:00:00Z"}
}

user_message = format_sensor_payload("WQ-Station-042", sample_readings)
print(user_message)

Step 4: Build the analysis function

Now I wire the formatted payload to Oxlo.ai. I use response_format={"type": "json_object"} to guarantee valid JSON, and I picked qwen-3-32b because it handles structured generation and multilingual reasoning well. That matters when I eventually expand this to international sites with bilingual reporting requirements. I set temperature to 0.1 because regulatory analysis should be deterministic, not creative.

def analyze_water_quality(site_id, readings):
    user_message = format_sensor_payload(site_id, readings)
    
    response = client.chat.completions.create(
        model="qwen-3-32b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ],
        response_format={"type": "json_object"},
        temperature=0.1
    )
    
    result = json.loads(response.choices[0].message.content)
    return result

Step 5: Batch process multiple monitoring sites

In production, I am never analyzing just one site. I wrap the call in a loop and collect results for an entire watershed. Because Oxlo.ai uses request-based pricing, I do not have to worry about blowing up costs when I append long historical context or detailed parameter lists to each prompt. A token-based provider would charge for every sensor reading in the context window, but here the flat per-request rate makes batch environmental surveys predictable. I dump the aggregated reports to a JSON file that the compliance team can ingest into their LMS.

sites = [
    ("WQ-Station-042", {
        "pH": {"value": 8.9, "unit": "standard units", "timestamp": "2024-05-14T09:00:00Z"},
        "Dissolved Oxygen": {"value": 4.2, "unit": "mg/L", "timestamp": "2024-05-14T09:00:00Z"},
        "Turbidity": {"value": 14.3, "unit": "NTU", "timestamp": "2024-05-14T09:00:00Z"},
        "Temperature": {"value": 29.5, "unit": "degrees C", "timestamp": "2024-05-14T09:00:00Z"}
    }),
    ("WQ-Station-017", {
        "pH": {"value": 7.2, "unit": "standard units", "timestamp": "2024-05-14T09:15:00Z"},
        "Dissolved Oxygen": {"value": 6.8, "unit": "mg/L", "timestamp": "2024-05-14T09:15:00Z"},
        "Turbidity": {"value": 3.1, "unit": "NTU", "timestamp": "2024-05-14T09:15:00Z"},
        "Temperature": {"value": 22.0, "unit": "degrees C", "timestamp": "2024-05-14T09:15:00Z"}
    })
]

reports = []
for site_id, readings in sites:
    report = analyze_water_quality(site_id, readings)
    reports.append(report)
    print(f"Processed {site_id}: {report['overall_status']}")

# Save to file for the compliance team
with open("compliance_reports.json", "w") as f:
    json.dump(reports, f, indent=2)

Run it

Execute the script from your terminal. You should see a status line for each site and a compliance_reports.json file on disk. Here is the output I got for the non-compliant station. Notice how the model cites the specific regulatory references and suggests concrete remediation steps.

{
  "site_id": "WQ-Station-042",
  "overall_status": "violation",
  "findings": [
    {
      "parameter": "pH",
      "value": 8.9,
      "unit": "standard units",
      "limit": "6.5 - 8.5",
      "status": "fail",
      "regulatory_reference": "40 CFR 131.36"
    },
    {
      "parameter": "Dissolved Oxygen",
      "value": 4.2,
      "unit": "mg/L",
      "limit": "minimum 5.0 mg/L",
      "status": "fail",
      "regulatory_reference": "EPA NRWQC 2016"
    },
    {
      "parameter": "Turbidity",
      "value": 14.3,
      "unit": "NTU",
      "limit": "10 NTU above background",
      "status": "fail",
      "regulatory_reference": "40 CFR 131.36"
    },
    {
      "parameter": "Temperature",
      "value": 29.5,
      "unit": "degrees C",
      "limit": "32 degrees C maximum",
      "status": "pass",
      "regulatory_reference": "EPA NRWQC 2016"
    }
  ],
  "recommended_actions": [
    "Investigate alkalinity source causing elevated pH",
    "Deploy aeration system to increase dissolved oxygen",
    "Conduct upstream sediment control inspection"
  ]
}

Wrap-up and next steps

This agent replaces a brittle rule engine with a reasoning layer that can cite regulations, handle edge cases, and suggest remediation. Two concrete ways to extend it: first, wire it to a real MQTT or LoRaWAN sensor stream so it evaluates readings as they arrive and triggers alerts. Second, add vision support by uploading photos of chemical test strips or analog gauge clusters using Oxlo.ai's kimi-k2.6 or gemma-3-27b-it vision capabilities, then merge the extracted values with the sensor payload before analysis. You can explore request-based pricing for long-context environmental reports at https://oxlo.ai/pricing.

Building Environmental Science Tools with LLMs: A Tutorial

What you'll need

Step 1: Configure the Oxlo.ai client

Step 2: Define the environmental science system prompt

Step 3: Format sensor data for the model

Step 4: Build the analysis function

Step 5: Batch process multiple monitoring sites

Run it

Wrap-up and next steps

Related articles

LLMs in Environmental Science: Applications and Opportunities

Using LLMs in Biology: A Guide

The Role of LLMs in Biology: Current Trends and Future Directions

Building Chemistry Tools with LLMs: A Step-by-Step Guide

Applying LLMs in Chemistry: Opportunities and Challenges

Applying LLM to Physics Research

Ready to build with Oxlo.ai?