I Spent Over $200 Teaching a Model What 'Clean' Means


Over a few weekends, I built a little project to ask questions about Los Angeles restaurant health inspections using OpenAI's Platform API. The dataset isn't crazy — just over 100k records. Just to be safe, I set a $5 daily budget cap on my demo app because I figured that was generous. It's a toy project — how many queries could possibly hit my cap?

Four, that's how many. Each one packed around 106,000 inspection records into the model's context window as raw CSV. With GPT 5.4, each query cost around $1.40. The fifth query bounced, forcing me to raise my own cap just to keep working on the thing.

Meanwhile, I'd built a second approach to the same problem — a structured pipeline that never sends raw data to the model at all. Same questions, same dataset. That's where I found savings — less than a penny per query. The $5 that choked on four context-packed queries would have covered about 1,600 of these.

That gap is what this post is about. Not which approach is "better" — but what you're actually trading when you choose one over the other.

What I built and why

The question I hear constantly in analytics engineering circles — in conference talks, in the blogs I read, in the way practitioners talk about AI — is the same: when someone asks a data question and an LLM answers it, how do you know the answer is right? The models keep getting better at sounding right. The part where you can actually verify the output is still a mess.

That's both the cost question and the correctness question. Packing a context window isn't just expensive — it also means you're trusting the model to do math, interpret definitions, and decide what "good" means, all at once, with no audit trail. A harness — structured code around a model that steers it toward specific outcomes — gives you a handle on both problems.

Cocina is my attempt to explore that. The app takes 106,694 LA County restaurant health inspections — spanning two years of scores, grades, violations, and facility types across 122 neighborhoods — and lets you ask questions about them in natural language.

Try it yourself: acwx.net/apps/cocina

Cocina — comparing Mexican and Korean restaurant scores

To test the idea, I ran the same question through three approaches side by side.

  • The raw model: I literally just ask it a question as if the model had been trained on this data. "Which neighborhood has the highest-rated restaurants?" You'll get an answer back — but it's relying entirely on what the model has already seen. No one builds production systems this way — I include it as a baseline to show what happens with zero context.

  • The data-stuffed model: The lazy-but-good-enough approach. Take all 106K records, download them, and mash them into the context window. From here, you're relying on the model to do its own math — computing averages, looking for outliers. The interesting thing is that with frontier models, it does pretty well, which I didn't totally expect. The catch: on GPT 5.4, it costs about $1.40 per query. Depending on how many questions you ask, that eats up cost fast. You'll soon read about how testing this approach led to a $216 bill from OpenAI.

  • The harnessed model: Steer the model with software. Rather than ask a raw question and hope for the best, or stuff all the records into the context window, here you write software to compute and do a lot of the work ahead of time — then only use the model to figure out what code to run and narrate the result back. Because most of the logic lives in actual software, you're not packing the context window, so queries are cheap whether you're using a frontier model or a distilled one. And because you're returning software-computed values instead of leaving it to the model's interpretation, you always get the same math back.

There are plenty of other approaches to this problem — RAG being the obvious one. I'm focused on harness design specifically because it's closest to the work I actually do: defining metrics, building eval suites, making analytical logic auditable. That's the part I wanted to explore.

I'm not trying to crown a winner here. I want to know what you're actually giving up when you pick one of these approaches over another.

What "harness" means in practice

I like Martin Fowler's definition:

A well-built outer harness serves two goals: it increases the probability that the agent gets it right in the first place, and it provides a feedback loop that self-corrects as many issues as possible before they even reach human eyes. Ultimately it should reduce the review toil and increase the system quality, all with the added benefit of fewer wasted tokens along the way.

A harness is structured code around a model that steers it toward specific outcomes. So I set out to build one. What would it actually mean to build a harness that classifies and delivers restaurant inspection results? After some tinkering, I landed on a five-stage pipeline. Two stages involve the model. Three are just code.

  1. The intent classifier reads your question and maps it to a structured query. "What's the average health score in Koreatown?" becomes a specific metric, dimension, and filter. This is the only part of the pipeline where the model makes a judgment call — deciding how your question maps to a specific metric.

  2. The query planner generates an exact lookup from the predefined metric dictionary. This and the next two stages are all code — no model involved.

  3. The metric resolver fetches pre-computed aggregates. I've already pre-aggregated all the metrics I want based on the dataset, so this step just finds the right set of results that are already calculated. This is also where the harness trades flexibility for consistency — because it only knows about pre-aggregated metrics, it can't answer questions about specific restaurants. That tradeoff comes back later.

  4. The validator checks sample sizes and data sufficiency — is this data trustworthy enough to return?

  5. The narrator takes the structured result and puts it back in a way that's readable — context about what the numbers mean and what the data doesn't cover.

The model only touches the edges — translating a question on the way in and narrating an answer on the way out. Everything in between is just code.

Crafting a definition layer

When someone asks "how clean are Echo Park's restaurants?" — what does "clean" even mean? Average health score? Percentage of A grades? Violation rate? Repeat offender frequency? These are all legitimate answers, and they tell different stories.

I had to pick. I defined a set of metrics — health score averages, grade distributions, violation rates, facility counts — and encoded them as the system's vocabulary. When you ask a question, the classifier maps it to one of these metrics. The math is pre-computed against the full dataset. The validator checks whether there's enough data to answer confidently.

// lib/cocina/metrics.ts — The Semantic Layer
// This is the heart of the harness. Each metric encodes domain knowledge
// about what "good analytics" means for restaurant health data.
import type { MetricDefinition } from "./types.ts";

export const METRICS: Record<string, MetricDefinition> = {
  health_score_avg: {
    id: "health_score_avg",
    name: "Average Health Score",
    description: "Mean inspection score (0-100) for a group of facilities",
    formula: "SUM(score) / COUNT(inspections)",
    dimensions: ["cuisine_type", "neighborhood", "year", "facility_type"],
    validation: { min_sample_size: 5 },
    unit: "points",
    interpretation: {
      "90-100": "Excellent — minimal violations",
      "80-89": "Good — minor issues",
      "70-79": "Fair — notable concerns",
      "below_70": "Poor — significant violations",
    },
  },
  ...
}

Those definitions live in version-controlled, tested code — not in a system prompt that silently changes behavior when someone edits it. Any change to a metric definition is traceable, reviewable, and caught by the eval suite before it ships. This matters for analytics. And for my pocketbook, they don't cost tokens. Even better, because this is code, it's testable — each metric definition has corresponding assertions in the eval suite that verify the math against ground truth from the raw dataset.

Ultimately, this is the part I think survives even if token costs drop to zero. What counts as a clean restaurant is subjective — it's a business decision, not a model decision. How you think about average health scores versus grade distributions versus repeat offender rates is the core question of any analytics problem. That judgment has to live somewhere, and someone has to make the call. You could put it in a system prompt, but then how do you audit it over time? There's no way to test it, and honestly, nobody else can even see it. Putting it in code gives you what code is good at — it's auditable, it's testable, and the rest of your team can actually weigh in on it.

The result: I can use a tiny model for the actual LLM work. The model isn't computing averages or filtering datasets — it's reading a question and narrating an answer. I could use OpenAI's GPT 5.4 nano model to get similar results, more quickly, at a fraction of the cost.

Building a harness is not for the weak (or broke)

My OpenAI dashboard for March: $216.58 💀. 55 million tokens. 2,887 API requests.

OpenAI dashboard — March spend

I burned through enough credits across a few weekends and roughly 20 hours of engineering time that OpenAI auto-promoted me from Tier 1 to Tier 3 — then Tier 3 to Tier 4 the following day.

OpenAI tier upgrade email — Tier 1 to Tier 3

Most of that was context-packed queries during development — every time I tested a question with all 106K records in context, the meter was running. The harnessed queries barely registered. I attempted to cache the requests but broke the code and had to bust/rebuild the cache several times. Every time that happened, I burned more tokens, and spent more credits.

There are no silver bullets in harness engineering

I wanted to make sure this thing actually works, so I wrote 25 targeted queries — each with ground truth computed from independent SQL aggregations against the raw CSV, locked as test fixtures before the pipeline existed — to stress the specific behaviors I care about: numerical precision, ranking accuracy, and edge-case handling.

These aren't meant to be exhaustive. They're smoke tests for the core use cases I built — the questions I'd be embarrassed to get wrong. If the harness can't nail these, nothing else matters.

The numbers passed: average scores, facility counts, grade distributions went 14/14. Rankings — cleanest neighborhoods, head-to-head comparisons — went 5/5. Pre-computed aggregates do the work here, which means there's no interpretation to get wrong. Behavioral queries (rejecting off-topic questions, deflecting recommendation requests) went 5/6. (Slightly better than Chipotle's support bot.)

But the more interesting story was the failure I found outside the eval suite. While testing, I asked for the best Jamaican restaurant in Leimert Park — one of the few neighborhoods in LA with Caribbean food. The harnessed result said there weren't any. This was obviously wrong. Then I asked for the worst-rated restaurant in that neighborhood — and got back a Jamaican restaurant.

And right there was the limitation of my harness. This is a consequence of pre-aggregation — the harness only knows about metrics I've defined, not individual restaurant records. A harness could be designed to handle both aggregates and row-level lookups, but that's more engineering. I chose consistency over flexibility, and this is what it costs. In an actual analytics product, this would be a massive trust buster.

That's the tradeoff you're making. The data-stuffed approach, for all its cost problems, found specific low-scoring restaurants by name that the harness couldn't — because it was looking at raw records, not aggregates. When a harnessed query fails, you can trace exactly why through debuggable code — and for anyone building analytics tools where the methodology needs to be defensible, that debuggability is how you earn trust. When a context-stuffed query gives you a wrong number, figuring out why is harder. (Tools like Braintrust and LangChain help with eval and observability for the stuffed approach — but having testable code you can debug locally is a different kind of confidence.)

Should you build a harness?

For exploration — ad-hoc questions where you want the model to surprise you — stuffing the context might genuinely be the better tool. Depending on how you value engineering time, you could stomach the cost and focus your efforts on something else.

But if you need the same question answered the same way every time — and you need to show your work when someone asks why the number changed — a harness gives you that.

This is not the best harness in the world, and I'd genuinely love to hear if there's an obvious flaw in my approach. It's the one I built while learning the trade-offs. I spent a few weekends and $216 on it, and I'm still finding gaps.

If models get cheap enough and reliable enough that packing a million records into context costs a fraction of a cent and returns consistent, auditable results with honest caveats — the economics argument disappears. Every generation is cheaper and smarter than the last. The last six months have shown that. I might be overengineering for a problem that solves itself in 18 months.

But the definitions layer isn't going away. Somebody has to decide what "clean" means, which metrics matter, and how to interpret the results. That logic has to live somewhere you can see it, test it, and argue about it. I'd rather that place be code than a system prompt.