L13 was one output checked against multiple criteria. Today: multiple outputs checked against expected answers. Five inputs, five expected labels — score how many your prompt got right.
from pydantic_ai import Agent
cases = [
("This is amazing!", "positive"),
("Worst experience ever.", "negative"),
("It was fine I guess.", "neutral"),
("Loved every minute.", "positive"),
("Total waste of time.", "negative"),
]
def classify(text):
return Agent(model).run_sync(
f'Classify the sentiment of this text as exactly one word: "positive", "negative", or "neutral". Reply with only the single word.\n\nText: {text}'
).output.strip().strip(".").lower()
pass_count = 0
for text, expected in cases:
got = classify(text)
ok = got == expected
pass_count += ok
print(f" {'PASS' if ok else 'FAIL'} — got={got!r} expected={expected!r} — {text}")
print(f"\n{pass_count} / {len(cases)} passed")
assert pass_count >= 3, f"only {pass_count}/5 passed"Each case has a known expected answer. The eval is got == expected, summed.
Right. The suite is your safety net when you change a prompt — re-run it, see how many cases still pass, decide whether the change is an improvement or a regression.
Why ≥ 3 instead of all 5?
LLMs are non-deterministic. A pass-rate threshold (3/5 = 60%) makes the verification stable across runs. Real eval suites set thresholds per cost/severity — for safety-critical classifiers you want 99%; for opinion mining 80% might be fine. Today's threshold is loose because we want the lesson to pass; production would tighten.
cases = [(input_1, expected_1), ..., (input_n, expected_n)]
pass_count = 0
for inp, exp in cases:
got = run(inp)
if matches(got, exp):
pass_count += 1
pass_rate = pass_count / len(cases)
assert pass_rate >= THRESHOLDA list of (input, expected) pairs + a runner + a comparator. Score is fraction passing.
| Piece | Role |
|---|---|
cases | The dataset — fixed. Becomes your regression suite. |
run(input) | Calls the LLM (or runs the agent) under your current prompt. |
matches(got, expected) | The comparator. Equality, regex, fuzzy match, range. |
THRESHOLD | Acceptable pass rate. Higher = stricter. |
LLMs are sampled. Even a perfect prompt fails occasionally. Aim for thresholds that:
For lessons we set the bar low so the verification is stable. For production, calibrate over many runs to find your steady-state pass rate, then alarm when it drops.
| Comparator | When to use |
|---|---|
got == expected | Exact label, classifier with closed set |
expected.lower() in got.lower() | Loose substring, allows wrapping prose |
| Regex | Format checks (phone, date) |
| Numeric tolerance | Math problems where rounding matters |
| Set equality | Order-independent lists |
pytest / unittest?The runtime doesn't ship those frameworks. We write the loop in plain Python. The structure is identical to what pytest.parametrize produces — a fixture list + assertion — without the framework overhead.
Five sentiment classification cases. Verification asserts at least 3 pass. Realistically the model will pass 4 or 5; the loose bar absorbs sampling noise.
L13 was one output checked against multiple criteria. Today: multiple outputs checked against expected answers. Five inputs, five expected labels — score how many your prompt got right.
from pydantic_ai import Agent
cases = [
("This is amazing!", "positive"),
("Worst experience ever.", "negative"),
("It was fine I guess.", "neutral"),
("Loved every minute.", "positive"),
("Total waste of time.", "negative"),
]
def classify(text):
return Agent(model).run_sync(
f'Classify the sentiment of this text as exactly one word: "positive", "negative", or "neutral". Reply with only the single word.\n\nText: {text}'
).output.strip().strip(".").lower()
pass_count = 0
for text, expected in cases:
got = classify(text)
ok = got == expected
pass_count += ok
print(f" {'PASS' if ok else 'FAIL'} — got={got!r} expected={expected!r} — {text}")
print(f"\n{pass_count} / {len(cases)} passed")
assert pass_count >= 3, f"only {pass_count}/5 passed"Each case has a known expected answer. The eval is got == expected, summed.
Right. The suite is your safety net when you change a prompt — re-run it, see how many cases still pass, decide whether the change is an improvement or a regression.
Why ≥ 3 instead of all 5?
LLMs are non-deterministic. A pass-rate threshold (3/5 = 60%) makes the verification stable across runs. Real eval suites set thresholds per cost/severity — for safety-critical classifiers you want 99%; for opinion mining 80% might be fine. Today's threshold is loose because we want the lesson to pass; production would tighten.
cases = [(input_1, expected_1), ..., (input_n, expected_n)]
pass_count = 0
for inp, exp in cases:
got = run(inp)
if matches(got, exp):
pass_count += 1
pass_rate = pass_count / len(cases)
assert pass_rate >= THRESHOLDA list of (input, expected) pairs + a runner + a comparator. Score is fraction passing.
| Piece | Role |
|---|---|
cases | The dataset — fixed. Becomes your regression suite. |
run(input) | Calls the LLM (or runs the agent) under your current prompt. |
matches(got, expected) | The comparator. Equality, regex, fuzzy match, range. |
THRESHOLD | Acceptable pass rate. Higher = stricter. |
LLMs are sampled. Even a perfect prompt fails occasionally. Aim for thresholds that:
For lessons we set the bar low so the verification is stable. For production, calibrate over many runs to find your steady-state pass rate, then alarm when it drops.
| Comparator | When to use |
|---|---|
got == expected | Exact label, classifier with closed set |
expected.lower() in got.lower() | Loose substring, allows wrapping prose |
| Regex | Format checks (phone, date) |
| Numeric tolerance | Math problems where rounding matters |
| Set equality | Order-independent lists |
pytest / unittest?The runtime doesn't ship those frameworks. We write the loop in plain Python. The structure is identical to what pytest.parametrize produces — a fixture list + assertion — without the framework overhead.
Five sentiment classification cases. Verification asserts at least 3 pass. Realistically the model will pass 4 or 5; the loose bar absorbs sampling noise.
Create a free account to get started. Paid plans unlock all tracks.