Why Evals Are a Design Problem

Most teams build their LLM system first, then write evals to verify it. This is exactly backwards.

Evals are not a testing layer. They are a design constraint. The moment you treat them as upstream rather than downstream, your entire approach to building LLM systems changes.

What an eval actually is

An eval is a formalized answer to the question: how would I know if this worked?

That question is not a testing question. It is a specification question. If you cannot answer it before you build, you do not understand the problem well enough to build a solution.

This is true for all software, but it bites harder with LLMs because the output space is enormous. There is no type error. There is no failed assertion at the call site. The system produces something, and you have to decide whether that something is good.

Without an eval, "good" is just vibes.

The structure of your eval shapes your system

Here is a pattern I have seen repeatedly: a team builds an agent, ships it, gets complaints, and then tries to write evals. They discover the evals are hard to write because the system's outputs are hard to measure. So they approximate, or they give up, or they write evals that measure proxies rather than the thing they actually care about.

The evals are hard to write because the system was not designed to be evaluated.

When you write the eval first, you are forced to ask: what is the unit of output I care about? How do I isolate it? How do I verify it without reading every response myself? Answering these questions produces structural insights that feed directly back into your system design.

You end up with cleaner output schemas. More explicit state. Better-defined intermediate steps. Not because you planned it that way, but because the eval required it.

What this means in practice

Before you write any LLM code, write one eval case. Not a suite. Just one. The most important thing the system should be able to do. Write it in whatever form you can: a test function, a checklist, a prompt that grades the output.

Then ask: can I run this evaluation automatically? If not, why not?

The gap between "I have a sense of whether this is working" and "I can run this evaluation automatically" is where most of the hard design work lives. Close that gap before you build anything else.

The eval is not a safety net you add after the fact. It is the contract you write before the work begins.