LLM evaluation that does not hurt
A lightweight rubric I use to grade LLM features before users do, with examples for reasoning and tool-heavy prompts.
Evaluation does not need a research team. It needs a repeatable rubric and a way to run it on every change. My baseline starts with 25 golden paths that mirror the top user journeys and a handful of deliberately messy edge cases.
Scoring that is actually helpful
- Reasoning: checks for step-by-step thinking and tool selection
- Grounding: penalizes hallucinated citations or missing context
- Guardrails: red teams prompts for safety and prompt injection
Each scenario has an expected shape rather than a single exact answer. I assert on structure (did we cite?), constraints (did we stay under budget?), and quality (did we use the right tool?).
type EvalResult = {
scenario: string
score: number
reasoning: string
traceId: string
}Evals run in CI on every prompt or tool change and again nightly with production traffic samples. The output is a dashboard that shows regressions by capability so product can decide to ship, fix, or roll back.
Includes prompts, runbooks, and rollout steps referenced here.