Evaluation.

8 min read · pairs with Eval Lab

A rubric turns “good” from a feeling into a spec. Once it's on the page, you can argue with it — and so can the rest of your team.

Why this is the foundation

Once you have a rubric, every other design move gets cheaper. Comparing two prompts? Run both against the rubric and look at the deltas. Comparing models? Same. Convincing a stakeholder that v3 is better than v2? Show them the scorecard. The rubric is what turns “trust me, it's better” into something they can read.

Three things to watch for when you write one:

Criteria that aren't criteria. “Good” isn't a criterion. Neither is “sounds AI-generated.” Replace with the specific quality you mean.
Overlap.If two criteria almost always move together, you've written one criterion twice. Merge them.
Missing the boring middle.Most outputs aren't terrible or excellent — they're 3s. A 1-5 scale captures the middle. Pass/fail doesn't.

The familiar move

You've run a usability study. You wrote task scenarios, decided what success looked like before the sessions started, and graded each participant against a rubric. Maybe it was numeric (completion in under 90 seconds), maybe it was thematic (did they need to back out and start over?). Either way, the rubric existed before the data, which is the whole reason it works.

Evaluation on a language model is the same move, applied to a different surface. You're not asking “was the output good?” — too vague to ever settle. You're asking “was it clear, concise, on-tone, actionable?” One question, four dials, each scorable on its own.

The lesson

Without a rubric, every output gets graded against a moving target. The reader squints, says “hmm, not quite,” and rewrites the prompt without knowing what they were trying to improve. With a rubric, the squinting becomes a specific claim: “tone dropped from 4 to 2 when I added the formality rule.” That's the difference between tweaking and designing.

The rubric is the artifact. The scores are the receipt.

A small example

Vague rubric

Rubric

"Write good empty-state copy."

What you can say

“Was it clear? I guess? Did it match the brand voice? Sort of. Actionable? Maybe.”

Read

Every grader scores it differently because the criteria are still in their heads. You can't iterate against this.

Specific rubric

Rubric

Clarity — easy on first read, no ambiguity. Tone — warm without sliding into chirpy. Actionability — user knows what to do next. Each scored 1-5.

What you can say

“Clarity 5. Tone 3 (a hair too enthusiastic). Actionability 4. Total 12/15.”

Read

The 3 on tone is the design surface. You know exactly where to push next.

Same output, two rubrics, different conversations. The first ends in a shrug. The second ends in a list of things to try.

Try it in the playground

Run the Eval Lab and move one criterion.

Open Eval Lab→

What to take into the playground

Start with the seeded rubric. Run the cases without changing anything. Score honestly. You're calibrating.
Pick the criterion where the model scored worst across cases. Tweak the system prompt to push that one number up.
Rerun. If the target criterion moved up but another moved down, you found a real trade-off — write a note about it. That's the artifact.
If every score went up, the rubric is probably too easy. Tighten a criterion description until you can score honestly again.

Next up

07Multi-turn flows