AI Patterns introduced eval — a few hand-checked examples. Eval at scale is the same idea, more cases, automated pass rate. The shape:
eval_cases = [
("What's the capital of France?", "paris"),
("What's the largest ocean?", "pacific"),
("Where is Mount Everest?", "himalaya"),
]
passes = 0
for query, expected in eval_cases:
answer = pipeline(query)
if expected.lower() in answer.lower():
passes += 1
rate = passes / len(eval_cases)
print(f"pass rate: {rate:.0%}")
assert rate >= 0.66, f"regression — {passes}/{len(eval_cases)} passed"Substring matching is good enough?
For factual recall — yes, surprisingly often. "Did the answer mention Paris?" is a useful signal. For style / tone / multi-fact answers, you'd need richer checks (regex, structured-output validation, or a second LLM as judge — though we don't use llm_judge in this curriculum).
And we run this every prompt change?
Yes. Eval suite is the seatbelt — it catches regressions before they ship. CI runs the eval, blocks the merge if pass rate drops.
def run_eval(pipeline, cases, threshold=0.66):
passes = 0
failures = []
for query, expected in cases:
answer = pipeline(query).lower()
if expected.lower() in answer:
passes += 1
else:
failures.append((query, expected, answer))
rate = passes / len(cases)
print(f"{passes}/{len(cases)} passed ({rate:.0%})")
for q, exp, ans in failures:
print(f" FAIL: {q!r} expected {exp!r} in answer")
if rate < threshold:
raise AssertionError(f"eval below threshold: {rate:.0%} < {threshold:.0%}")
return rate(query, expected_substring)The simplest matcher: is the expected substring in the answer? Cheap to write, cheap to check. Captures "the model answered correctly" for fact-recall queries.
When substring isn't enough:
| Matcher | When |
|---|---|
| Regex | Date format, ID format, structured output |
json.loads + key check | The answer is JSON; check field present |
| Set membership | Answer is one of an allowed set |
| Multi-substring (all-must-appear) | Multi-fact answers |
| Pydantic validation | Typed structured output |
The assertion rate >= threshold is what makes the suite block on regression. Without it, eval is just a printout — a human has to notice. With it, CI / a build pipeline catches the regression for you.
For a brittle pipeline, 100% threshold is too strict; for a flaky one, 50% lets bugs through. 66–80% is typical — enough to catch regressions, lenient enough to absorb LLM stochasticity.
Real eval suites have hundreds of cases. Each is one LLM call. Run the full suite on every prompt change → expensive. Common pattern:
For this lesson we run a 3-case smoke (cost-aware authoring) — same shape, smaller numbers.
AI Patterns introduced eval — a few hand-checked examples. Eval at scale is the same idea, more cases, automated pass rate. The shape:
eval_cases = [
("What's the capital of France?", "paris"),
("What's the largest ocean?", "pacific"),
("Where is Mount Everest?", "himalaya"),
]
passes = 0
for query, expected in eval_cases:
answer = pipeline(query)
if expected.lower() in answer.lower():
passes += 1
rate = passes / len(eval_cases)
print(f"pass rate: {rate:.0%}")
assert rate >= 0.66, f"regression — {passes}/{len(eval_cases)} passed"Substring matching is good enough?
For factual recall — yes, surprisingly often. "Did the answer mention Paris?" is a useful signal. For style / tone / multi-fact answers, you'd need richer checks (regex, structured-output validation, or a second LLM as judge — though we don't use llm_judge in this curriculum).
And we run this every prompt change?
Yes. Eval suite is the seatbelt — it catches regressions before they ship. CI runs the eval, blocks the merge if pass rate drops.
def run_eval(pipeline, cases, threshold=0.66):
passes = 0
failures = []
for query, expected in cases:
answer = pipeline(query).lower()
if expected.lower() in answer:
passes += 1
else:
failures.append((query, expected, answer))
rate = passes / len(cases)
print(f"{passes}/{len(cases)} passed ({rate:.0%})")
for q, exp, ans in failures:
print(f" FAIL: {q!r} expected {exp!r} in answer")
if rate < threshold:
raise AssertionError(f"eval below threshold: {rate:.0%} < {threshold:.0%}")
return rate(query, expected_substring)The simplest matcher: is the expected substring in the answer? Cheap to write, cheap to check. Captures "the model answered correctly" for fact-recall queries.
When substring isn't enough:
| Matcher | When |
|---|---|
| Regex | Date format, ID format, structured output |
json.loads + key check | The answer is JSON; check field present |
| Set membership | Answer is one of an allowed set |
| Multi-substring (all-must-appear) | Multi-fact answers |
| Pydantic validation | Typed structured output |
The assertion rate >= threshold is what makes the suite block on regression. Without it, eval is just a printout — a human has to notice. With it, CI / a build pipeline catches the regression for you.
For a brittle pipeline, 100% threshold is too strict; for a flaky one, 50% lets bugs through. 66–80% is typical — enough to catch regressions, lenient enough to absorb LLM stochasticity.
Real eval suites have hundreds of cases. Each is one LLM call. Run the full suite on every prompt change → expensive. Common pattern:
For this lesson we run a 3-case smoke (cost-aware authoring) — same shape, smaller numbers.
Create a free account to get started. Paid plans unlock all tracks.