Data desk · LA inspections

Cocina

Three approaches to the same question. Same model, same data — very different answers.

Ask a question about LA restaurant health inspections and watch what happens when you give an LLM nothing, give it raw data, and give it data plus a structured harness — metric definitions, validation rules, and an interpretation framework.

The model matters — but the structure around it matters more. A companion to I Spent Over $200 Teaching a Model What "Clean" Means.

No Context

LLM only — vibes

Raw Data

LLM + CSV dump

Structured Harness

LLM + data + pipeline

Try asking

How It Works

1
LLM — no data, no harness
LLM gets your question with no data, no metrics, no validation — training data impressions and confident vibes.
2
LLM + Data — context stuffed
LLM + Data gets every matching raw inspection record dumped into the context window as CSV — the actual rows, not pre-computed averages. Same records the harness computed from. It has to do its own math. Same model, same data — but no metric definitions, no validation rules, no interpretation framework — the "just stuff it in the prompt" approach.
3
LLM + Data + Harness — structured pipeline
LLM + Data + Harness runs a 6-stage pipeline. First, it figures out what you're actually asking — average scores, grade distributions, trends, or something else. Then it builds a query plan and looks up pre-computed answers. Instead of asking the LLM to do math, we calculated averages, distributions, and trends for every cuisine and neighborhood combination ahead of time. The LLM never touches raw data — it gets the answer sheet.
Before responding, the harness checks whether it has enough data to be trustworthy. Three inspections for a combination? It says so, instead of pretending a tiny sample is reliable. Then it generates a narrative with comparisons and caveats — and a verifier checks that the narrative actually contains the key numbers from the data. Four of six stages are deterministic TypeScript with zero API cost. The LLM only touches the edges.

The harness doesn't always win. Ask something vague and watch it refuse. Ask about inspection frequency and watch the raw LLM outperform it. The point isn't perfection — it's knowing when to answer and when not to.

Data: LA County Dept. of Public Health · 106,694 inspections · 2023–2025 · geocoded against LA Times boundary polygons. Pre-computed into 3,600+ metric aggregates across 34 cuisine types and 122 neighborhoods.

About This Project

Evaluation Results

34 hand-curated queries against ground truth extracted directly from the dataset — numerical accuracy, rankings, behavioral edge cases, cuisine-specific lookups, and boundary-behavior tests. Each query has a verifiable correct answer.

Approach	Accuracy	Pass / Total
LLM (raw)	26%	9 / 34
LLM + Data (stuffed)	47%	16 / 34
LLM + Data + Harness	100%	34 / 34

34 queries across five categories, each with a verifiable correct answer. The full eval suite — ground truth, queries, expected values, results.

Who's Kazan?

Kazan 🌋 is my AI assistant — an OpenClaw agent running Claude Opus 4.6, connected to my tools, files, and workflows. I co-developed this project with Kazan — from brainstorming the concept, to researching the dataset, to writing the code, to iterating on the UI based on my feedback. I brought the product thinking, the metric design, and the editorial judgment. Kazan brought the code generation, data processing, and the ability to ship at 2 AM while I slept.

If a model starts generating its own metric definitions and validation rules unprompted, I'll revisit this thesis.