AI/MLAugust 22, 20249 min read
Featured

LLM evaluation that does not hurt

A lightweight rubric I use to grade LLM features before users do, with examples for reasoning and tool-heavy prompts.

llm
evaluation
quality
langchain
observability

Evaluation does not need a research team. It needs a repeatable rubric and a way to run it on every change. My baseline starts with 25 golden paths that mirror the top user journeys and a handful of deliberately messy edge cases.

Scoring that is actually helpful

  • Reasoning: checks for step-by-step thinking and tool selection
  • Grounding: penalizes hallucinated citations or missing context
  • Guardrails: red teams prompts for safety and prompt injection

Each scenario has an expected shape rather than a single exact answer. I assert on structure (did we cite?), constraints (did we stay under budget?), and quality (did we use the right tool?).

type EvalResult = {
  scenario: string
  score: number
  reasoning: string
  traceId: string
}

Evals run in CI on every prompt or tool change and again nightly with production traffic samples. The output is a dashboard that shows regressions by capability so product can decide to ship, fix, or roll back.

Good evals make you confident enough to move fast, and humble enough to stop when the graph dips.
Key takeaways
Highlights you can reuse.
Golden sets beat vibes: start with 25 scenarios, not 250
Scorecards over black boxes: show which capability failed
Ship evals with CI and dashboards, not spreadsheets
Downloadable template
Copy the checklist and adapt it to your stack.

Includes prompts, runbooks, and rollout steps referenced here.

Shipping an AI feature in a single weekend
The constraints, scaffolding, and observability I lean on to take an idea from notebook to production by Monday morning.
Build log
8 min read
Read
Edge AI with Workers and Rust
Running inference at the edge with predictable latency, shared wasm modules, and a hybrid routing plan for heavier models.
Edge
8 min read
Read
Practical data contracts for small teams
How to stop schema breakage without drowning in governance: contracts, lineage, and a 30-minute weekly review.
Data
6 min read
Read