How to evaluate an AI feature before you ship it

Most AI feature launches we see skip the evaluation step entirely. The team builds the feature, the demos go well, the launch ships, and the feature begins quietly hallucinating at customers. Within a quarter, the team has a backlog of "the AI said something weird in this case" tickets, no systematic way to know whether the model is improving or degrading, and a vague sense of unease about whether to invest more or pull back. The cause is the same in every case: nobody set up evals.

An eval is just a test suite for a non-deterministic system. You write down: "Here are 100 representative inputs. For each one, here's what a good output looks like, or here are the things a good output should/shouldn't contain." You run the model on all 100 inputs. You measure how often the output meets your criteria. That number is your baseline. Every change you make — a new model, a prompt tweak, a different retrieval setup — gets compared to the baseline. If the number goes up, you've improved. If it goes down, you've regressed.

The criteria don't have to be quantitative. Some can be: "the output should not exceed 200 words." Some have to be judgmental: "the output should answer the question asked." For the judgmental ones, you can use another LLM as a judge, with a clear rubric — "is this answer accurate based on the source document? grade 1 to 5." The LLM-judge has its own biases and noise, which is why you spot-check the judge's grades manually for a sample, but it scales the eval to hundreds of examples in a way human review can't.

The bar isn't that the eval is perfect. The bar is that the eval exists. A 70%-accurate eval that runs in 5 minutes on every change is infinitely more useful than a perfect eval that nobody runs. Start with 20 examples. Add more as you find failure modes the eval missed. Iterate the rubric. By the time the feature ships, you should have a number that represents the feature's quality — and a way to know, the next time you change something, whether you made it better or worse.

Want to talk about something in this post? Get in touch.More on AI

How to evaluate an AI feature before you ship it

Why your first AI agent should be embarrassingly small

Model selection isn't a model decision