AI Engineering
Eval & Testing for AI
"If you can't measure it, you can't improve it." Evals are the #1 gap in most AI projects — teams ship prompts based on vibes instead of data. This pod covers how to systematically test and compare AI outputs so you know when a change makes things better or worse. This is Month 5 material in your SE-to-AI roadmap.
2
Minutes
5
Concepts
+45
XP
1
Types of Evals
Accuracy (Classification / Extraction)

Compare model output against a golden label. Metrics: precision, recall, F1.

Input: "The server is down and users can't log in"
Expected: {"category": "bug", "severity": "critical"}
Actual:   {"category": "bug", "severity": "critical"}  ✅
Quality (LLM-as-Judge)

Use a stronger model (Opus) to grade a weaker model's (Sonnet) output against a rubric:

  • Define 3-5 criteria with clear scoring (1-5 scale)
  • Provide the rubric, the input, and the output to the judge
  • Aggregate scores across your test set
Safety (Red-Teaming)

Probe the model with adversarial inputs — prompt injection, jailbreaks, PII leaks. Essential before any user-facing deployment.

Latency & Cost

Track tokens consumed, time-to-first-token, total response time. Critical for production — a prompt that's 20% better but 3x more expensive might not be worth it.