Most teams ship AI features without knowing whether they got better or worse after the last release. That's not a technology problem. It's a product discipline problem.
After this page, you’ll be able to:
- The difference between offline evals (golden sets, regression suites) and online evals (implicit signals, LLM-as-judge)
- Why evals are harder than the model work — and the specific places teams fail
- How to build a release gate that prevents AI regressions from reaching users
- What a PM owns in the eval pipeline vs. what engineering owns
If you have shipped one AI feature, you have probably had this experience: the team makes a change — a new model, a tweaked prompt, a different retrieval strategy — and declares success because the demo looks better. Three weeks later, a user files a support ticket. You pull the logs. The feature has quietly gotten worse on a class of inputs nobody was testing.
Evals are the engineering discipline that prevents this. They are also the thing that most PM teams outsource to engineers, ignore entirely, or conflate with AB testing. This page covers what an eval actually is, the two categories that matter, and what a PM must own even if an engineer builds the pipeline.
Why this is the hard part
The technical challenge of AI products is not picking a model or writing a prompt. Those decisions are reversible and improvable. The hard part is knowing whether you improved or regressed across the full distribution of real inputs — and doing that fast enough to ship with confidence.
Consider what you are actually testing. A traditional feature either works or it doesn't. A form field that validates email addresses either accepts "notanemail" or it doesn't. AI outputs are probabilistic, open-ended, and context-dependent. There is no single right answer. Quality is a distribution, not a Boolean. And the distribution of real user inputs is always messier than your dev environment suggested.
This is compounded by the evaluation bootstrap problem: to build a good eval set, you need to know what "good" looks like. To know what "good" looks like, you need to have seen a lot of outputs. Most teams start with neither.
The PM's job is to supply the judgment that defines "good" before engineering builds the measurement infrastructure around it. If you cannot articulate what a high-quality output looks like — and what a poor-quality output looks like — you cannot build evals. You can only build vibes.
If you cannot articulate what a high-quality output looks like — in terms specific enough that a stranger could rate it consistently — you cannot build evals. You can only build vibes.
Offline evals: what they are and how to build them
An offline eval runs against a fixed dataset before deployment. Think of it as a test suite, but for probabilistic outputs.
The core artifact is a golden set: a collection of (input, expected output, quality rubric) triples that represent the real distribution of your use case. "Golden" means human-verified — at some point, a person (usually you or a domain expert) looked at these inputs and outputs and said "yes, this is what good looks like."
Building a golden set:
-
Collect real inputs. Pull 200-500 actual user queries or documents from production (if you have a live product) or synthesize them to cover known edge cases (if pre-launch). Real inputs are almost always more diverse and surprising than synthetic ones.
-
Define quality dimensions explicitly. Vague criteria ("is it helpful?") produce inconsistent ratings. Concrete criteria produce consistent ones. For a customer support draft-generation feature, your dimensions might be: (a) factual accuracy, (b) appropriate tone, (c) action completeness (does it tell the user what to do?), (d) no hallucinated policy. Each dimension rated 1-3.
-
Score 50-100 examples with human labels. This is labor-intensive, but there is no substitute for the first pass. You are training your intuition about what quality looks like, not just the eval pipeline. Consider this product research, not overhead.
-
Write automated checks for what you can. Some dimensions can be tested programmatically: does the output contain a link? Is it under 300 words? Does it mention a specific required disclaimer? Automated checks are cheap, fast, and reproducible. Human rating is expensive and slow. Push everything you can to automated, reserve human rating for the dimensions that genuinely require judgment.
-
Add regression cases as you find failures. Every production failure or near-miss should be converted into a golden set entry. The eval grows as you ship and learn.
A regression suite is the golden set run in CI (continuous integration) on every model or prompt change. If your regression score drops below a threshold, the PR does not merge. This is the mechanism that prevents "the demo looks better" from masking a degradation on your long-tail inputs.
Run the full regression suite on every prompt, model, or retrieval change — not just the happy-path examples. Improvement on the top cases while regressing on edge cases is not improvement; it is a tradeoff you have not been shown.
Pre-launch review for a new prompt version. PM and ML engineer reviewing eval results.
ML Engineer: “The new prompt improves our top-10 golden set examples by 18%. It's clearly better.”
PM: “What happened to the full 400-entry regression suite?”
ML Engineer: “Uh. I only ran it on the top examples — those are the representative ones.”
PM: “The top examples are the ones that were already working well. What I care about is whether we regressed on the edge cases. Pull the full regression report.”
ML Engineer: “...it's 12% worse on the 'short query' bucket. The model is getting confused by two-word queries.”
PM: “That's our most common query type. This doesn't ship.”
The PM had built the regression suite two months ago after a similar incident. The ML engineer had started testing around it. The discipline held.
Improvement on the happy path is not improvement overall. The regression suite is what catches the 'better on top cases, worse everywhere else' failure mode.
Online evals: what they are and how to read them
Online evals measure quality in production, in real time, on real user behavior. They are complementary to offline evals, not a replacement.
The three main categories:
Explicit signals. Thumbs up/down, star ratings, "Was this helpful?" These are high-signal but low-volume. Users rarely rate — typical explicit feedback rates run 1-3% of queries. Don't design your eval pipeline to depend on explicit signals as the primary metric.
Implicit signals. What users do after the AI output, not what they say about it. Did the user copy the text? Edit it heavily or lightly? Proceed with the suggested action? Abandon the flow? These signals are high-volume and low-noise when designed carefully. Cursor, for example, tracks whether generated code is accepted, deleted, or modified — that is a cleaner quality signal than "did the user like the code?"
LLM-as-judge. Run a second LLM over a sample of your production outputs and ask it to score quality against your rubric. This scales where human rating doesn't, costs much less than human annotation, and is surprisingly effective for well-defined quality dimensions. It is not a substitute for human judgment on novel failure modes — the judge model has the same blindspots as the evaluated model if they share an architecture and training distribution. The practical rule: use LLM-as-judge for monitoring at scale, human judgment for triage and rubric definition.
What to monitor on day one:
- Thumbs-down rate (or whatever explicit feedback you surface)
- Refusal rate (model refusing to answer — may indicate prompt engineering issues or policy friction)
- Empty / truncated output rate (a symptom of context window issues or formatting problems)
- Task completion rate (if your feature has a clear downstream action: did the user complete it after seeing the AI output?)
The release gate: connecting evals to shipping decisions
An eval system without a release gate is a measurement system with no feedback loop. The gate is what makes evals actually change behavior.
The minimum viable release gate:
- Regression suite runs on every PR that touches a prompt, model, or retrieval config
- Regression score must be ≥ baseline (or within a defined tolerance, e.g., ≤ 3% degradation on any quality dimension)
- PM reviews the eval delta report before approving the PR — not just "all green," but "what improved and what degraded and do I accept that tradeoff?"
The third step is the one engineers often skip. "All green" means "did not regress below threshold." It does not mean "is an improvement worth shipping." The PM's job at the release gate is to read the eval delta as a tradeoff: this change improved [quality dimension A] and degraded [quality dimension B]. Is that a trade worth making for users?
Your AI writing assistant shipped a new model last week (GPT-4o → GPT-4o-mini). The regression suite shows: overall quality score improved by 4% (the new model follows instructions more consistently), but 'tone accuracy' dropped 9% on formal-document inputs. Formal documents are 30% of your user base. The engineering team wants to ship.
The call: Do you approve the release? What is your reasoning and what do you do next?
Your reasoning:
What a PM owns in the eval pipeline
This is the most important table on this page.
| What a PM owns | What engineering owns |
|---|---|
| Defining what "good" looks like for each quality dimension | Building the eval runner and CI integration |
| Seeding and reviewing the golden set (at least the first 100 entries) | Scaling the pipeline, automating scoring |
| Setting the regression thresholds (what degradation is acceptable) | Monitoring infrastructure and alerting |
| Reading the eval delta at release time and making the ship/hold call | Storing, visualizing, and querying eval results |
| Deciding which online signals to instrument and why | Implementing the tracking events |
| Triaging production failures into golden set regression cases | Engineering the LLM-as-judge pipeline |
The reason most AI eval systems are weak is that PMs treat all of the left column as "engineering work." It isn't. The judgment work — what is good, what threshold matters, what tradeoff is worth making — is irreducibly a product decision. An engineer can build the most sophisticated eval pipeline in the world, and it will produce garbage outputs if nobody articulated what quality means.
Defining what "good" looks like, seeding the golden set, setting regression thresholds, and making the ship/hold call at release time — these are PM work, not engineering work. Outsourcing them produces a technically sophisticated pipeline that measures nothing that matters.
What to do this week
-
Write down your quality dimensions for one AI feature. Not "good quality" — actual, specific, ratable dimensions. Test by asking: could a contractor I've never met rate an output consistently on this dimension using this rubric? If no, it's not concrete enough.
-
Collect 50 real user inputs. Pull them from logs, user research, or production. If pre-launch, write them yourself — but make them specific and varied, not idealized. Set them aside as the seed for your golden set.
-
Find your current implicit signals. What does the user do after they see your AI output? Is that tracked? If not, identify one event to instrument this sprint.
Where to go next
- Experimentation — the statistical foundation under any AB test
- Data-Informed Decisions — how to read an eval result honestly
- Safety and Auditability — what to do when the model is wrong in consequential ways
- RAG Architecture — building retrieval systems that evals can actually test