Technical Strategy

Evals: how to know if your AI product is actually working

Most founders ship an AI feature, watch it work in a demo, and call it done. Then users find the edge cases. Here’s how to build a feedback loop that catches problems before your users do.

There’s a specific kind of panic I’ve watched founders experience a few weeks after launch. The demo looked great. Early users seemed happy. Then someone posts a screenshot on Twitter showing the AI generated something confidently wrong — and now you’re doing damage control on a bug you had no visibility into.

The root cause is almost always the same: they shipped with no evaluation layer. Just a prompt, an API call, and a response rendered directly to the user. No checks. No logging. No way to know the difference between “this output was great” and “this output was plausible but wrong.”

This is the part of AI product development that doesn’t get enough attention. Everyone talks about prompts, models, and architecture. Almost nobody talks about evals — and yet evals are what determine whether your AI product is actually reliable or just lucky in demos.

What “evals” actually means

Evals is short for evaluations. In the AI world it means: systematic ways of measuring whether your model’s outputs are good. That definition sounds obvious, but the execution is the hard part.

For traditional software, testing is well-understood. You write a function, you write a test, you assert that input X produces output Y. The outputs are deterministic. Either the test passes or it doesn’t.

AI outputs aren’t deterministic. Ask the same question twice and you might get two different answers, both of which are technically correct, or one of which is subtly wrong in a way that’s hard to detect automatically. So you can’t just assert equality. You need a different framework for what “correct” means — and that framework has to be built intentionally, because it won’t emerge on its own.

The three levels of eval maturity

I think about eval maturity in three stages. Most early-stage products are at zero. Getting to level one is what I push for before launch. Level three is for later.

Level 0: Ship and hope. No logging, no evaluation, no feedback loop. You’ll find out something’s wrong when a user tells you — or doesn’t. This is where most V1 products live, and it’s the most dangerous place to be.

Level 1: Log everything, evaluate something. Every LLM call is logged with its input, output, and any relevant context. You’ve defined a handful of simple automated checks — does the output meet a minimum length? Does it avoid certain phrases? Does it parse as valid JSON if it’s supposed to be structured? You review failures manually. You’re building intuition about where the model breaks.

Level 2: A golden dataset and regression tests. You’ve collected real examples where you know what a good output looks like — this is your golden dataset. Every time you change a prompt or update a model, you run your outputs against this dataset and check for regressions. You catch problems before they reach users.

Level 3: LLM-as-judge and continuous eval. You’re using another LLM (or a fine-tuned classifier) to evaluate outputs at scale. You have dashboards, alerts, and automated tests running on every deploy. This is where mature AI teams operate.

For a V1, I want level 1 before launch and a path to level 2 within the first month of real users. Level 3 comes when you have enough volume to make it worth building.

A minimal eval pipeline: every output is logged, failures are reviewed, and reviewed examples become a regression suite that runs on each deploy

Click to zoom

What to actually evaluate

This is where a lot of teams get stuck. They know they should evaluate something, but they don’t know what. The answer depends heavily on your specific product, but there’s a set of properties worth checking for almost any AI output.

Correctness. Is the output factually accurate or logically sound? This is hardest to automate — but even partial checks help. If you’re generating summaries, does the output contain all the key entities from the source? If you’re answering questions, does the answer appear in the context you provided?

Format compliance. If you asked for JSON, is it valid JSON? If you asked for bullet points, are there bullet points? Format failures are easy to detect programmatically and surprisingly common.

Tone and voice. Does the output match your product’s voice? For a professional tool, is it appropriately formal? For a consumer app, is it natural and conversational? Hard to automate fully, but LLM-as-judge works reasonably well here.

Safety and policy compliance. Did the output avoid things it shouldn’t say? This one’s especially important for products in sensitive domains — healthcare, legal, finance, anything where a wrong answer has real consequences.

Latency and cost. Not about quality, but worth tracking. A prompt that works but costs three times what you budgeted will bite you at scale.

The golden dataset: your most valuable asset

If I had to pick one thing to get right early, it’s the golden dataset. This is a curated set of inputs paired with outputs you’ve judged to be good. It’s your ground truth.

You don’t need many to start. Thirty to fifty examples covers a lot of ground for a V1. The key is that they represent the real range of what users will actually send — including the weird edge cases, the short inputs, the inputs with typos, the ones where the right answer is “I don’t know.”

Where do these come from? Some from your own testing before launch. More from real user interactions once you’re live — you review the logs, mark good examples, mark failures, and over time the dataset grows. It’s a flywheel. More users means more data, which means a better eval suite, which means fewer regressions when you change things.

The practical implication: you need logging from day one. Not optional. Every call, every output, stored somewhere reviewable. Without that, you’re flying blind and you can’t build the dataset that makes everything else work.

The prompt change problem

Here’s a scenario I’ve seen play out more than once: you tweak a prompt to fix one thing, and it breaks something else. A small change in wording shifts the model’s behavior in ways you didn’t anticipate. Without a regression suite, you won’t catch it until a user does.

Prompts are code. They should be version-controlled (they’re in your repo, right?), and changes to them should trigger an eval run. This doesn’t need to be elaborate — even running your fifty golden examples through the new prompt and comparing outputs manually takes an hour and catches real problems.

Once you have the infrastructure, you can automate this as part of CI. Before you have that, doing it manually is still dramatically better than not doing it at all. The goal is to make “how do we know this prompt change didn’t break anything” a question you can actually answer.

A note on user feedback as an eval signal

Don’t underestimate the value of simple thumbs-up / thumbs-down on your AI outputs. It feels almost too basic to bother with. It isn’t. Real users telling you which outputs they found useful is signal you can’t generate synthetically, and it maps directly to the thing you actually care about: does this output help the person who asked for it?

I’ve seen products where the automated evals looked fine but the thumbs-down rate was quietly climbing. Users were marking things wrong that passed all the technical checks — the output was grammatically correct, format-compliant, and on topic, but it wasn’t actually useful to them. That feedback loop only exists if you build it.

So: log your outputs. Add a feedback mechanism. Build a golden dataset from both. Run regressions before deploying prompt changes. None of this is technically complex. It’s just discipline. And the founders who have it ship more confidently, iterate faster, and avoid the late-night panics.

Founder Takeaway

Before you ship any AI feature, answer two questions: where are outputs being logged, and what happens when one is bad? If you can’t answer both, you have a blind spot that will eventually bite you.

Start minimal: log every LLM call, write three to five automated checks for the most obvious failure modes (format, forbidden phrases, basic sanity), and add a thumbs-down button so users can flag problems. That’s your eval foundation. As you accumulate real failures, turn them into a golden dataset and start running regressions before prompt changes. You don’t need a sophisticated platform for this — a spreadsheet and a script is enough to start. The habit matters more than the tooling.