LLM evaluation that does not hurt

A lightweight rubric I use to grade LLM features before users do, with examples for reasoning and tool-heavy prompts.

llm

evaluation

quality

langchain

observability

Evaluation does not need a research team. It needs a repeatable rubric and a way to run it on every change. My baseline starts with 25 golden paths that mirror the top user journeys and a handful of deliberately messy edge cases.

Scoring that is actually helpful

Reasoning: checks for step-by-step thinking and tool selection
Grounding: penalizes hallucinated citations or missing context
Guardrails: red teams prompts for safety and prompt injection

Each scenario has an expected shape rather than a single exact answer. I assert on structure (did we cite?), constraints (did we stay under budget?), and quality (did we use the right tool?).

type EvalResult = {
  scenario: string
  score: number
  reasoning: string
  traceId: string
}

Evals run in CI on every prompt or tool change and again nightly with production traffic samples. The output is a dashboard that shows regressions by capability so product can decide to ship, fix, or roll back.

Good evals make you confident enough to move fast, and humble enough to stop when the graph dips.

Key takeaways

Highlights you can reuse.

Golden sets beat vibes: start with 25 scenarios, not 250

Scorecards over black boxes: show which capability failed

Ship evals with CI and dashboards, not spreadsheets

Downloadable template

Copy the checklist and adapt it to your stack.

Includes prompts, runbooks, and rollout steps referenced here.