Treat evals as “unit tests + launch gates.” Every release must automatically run human/auto evals and jailbreak suites; no ship if below gate. Production must have observability and rollback.
Human evaluations
- A/B pairwise with statistical significance (p < 0.05)
- Rubrics for accuracy/completeness/safety/tone
- Task success rate: percentage of actionable answers
Automated checks
- Policy classifiers and content filters
- Benchmarks: BLEU/ROUGE/EM/F1 when references exist
- Tool unit tests: parameter validation, error paths, timeouts
Red‑teaming and guardrails
Cover jailbreak, hallucination, prompt injection, privacy leaks; failed samples become test cases. Gate example: jailbreak pass‑rate ≥ 99% / harmful rate < 0.05%.
Production monitoring
Track latency, refusal/harmful rate, thumbs up/down, top failed prompts, tool error rate. Auto‑degrade/rollback on thresholds. Feed incidents back into new tests and policy updates.
Release flow
- Offline regression (human/auto/jailbreak) → pass gates.
- Canary 1–5% → stabilize metrics.
- Ramp up → define rollback thresholds and bypasses.