"Did the LLM do well?" is a vague question. Did the output meet these specific criteria? is a question your code can answer. Today: a tiny eval checklist for one output.
from pydantic_ai import Agent
prompt = "Write a one-sentence elevator pitch for a library book club. Mention the word 'book'."
output = Agent(model).run_sync(prompt).output.strip()
# Three criteria
criteria = {
"length_ok": 5 <= len(output.split()) <= 30, # 5-30 words
"ends_with_period": output.endswith("."), # format check
"contains_book": "book" in output.lower(), # keyword check
}
for name, passed in criteria.items():
print(f" {'PASS' if passed else 'FAIL'} — {name}")
assert all(criteria.values()), f"failed: {criteria}"Each criterion is a Python check. Pass/fail is bool. The eval is a dict.
Right. The structure scales. Three criteria today; thirty if you're shipping serious software. Each one is a small predicate — easy to write, easy to debug, free to run.
What if a criterion fails — do I retry?
For single-output evaluation, you might retry (with feedback, like in L11). For suite evaluation (next week), you don't retry — you record the failure and use it to drive prompt improvements. Evaluation tells you what's broken; iteration fixes it.
output = call_llm(prompt)
criteria = {
"check_1": predicate_1(output),
"check_2": predicate_2(output),
"check_3": predicate_3(output),
}
score = sum(criteria.values()) / len(criteria) # pass rateA dict of name → bool. Pass rate is sum / len. Trivial to compute, trivial to log.
| Family | Examples |
|---|---|
| Length | 5 <= words <= 30, chars < 500 |
| Format | ends with period, JSON-parseable, regex match |
| Content | contains required keyword, mentions specific entity |
| Negative | does NOT contain blocked terms |
| Structure | required keys present, nested types correct |
| Numeric | sum of fields ≈ expected, count of items in range |
All deterministic. All cheap.
We explicitly do not use an LLM as the evaluator. (See feedback_no_llm_judge.md.) Reasons:
If a criterion seems to need an LLM judge ("is this response polite?"), narrow the lesson — write a deterministic check that approximates politeness (no all-caps, no swearwords from a list, length not hostile-short).
One output. Three deterministic checks. All three must pass. Foundation for next week's eval suite (input/expected pairs + the same kind of checking).
"Did the LLM do well?" is a vague question. Did the output meet these specific criteria? is a question your code can answer. Today: a tiny eval checklist for one output.
from pydantic_ai import Agent
prompt = "Write a one-sentence elevator pitch for a library book club. Mention the word 'book'."
output = Agent(model).run_sync(prompt).output.strip()
# Three criteria
criteria = {
"length_ok": 5 <= len(output.split()) <= 30, # 5-30 words
"ends_with_period": output.endswith("."), # format check
"contains_book": "book" in output.lower(), # keyword check
}
for name, passed in criteria.items():
print(f" {'PASS' if passed else 'FAIL'} — {name}")
assert all(criteria.values()), f"failed: {criteria}"Each criterion is a Python check. Pass/fail is bool. The eval is a dict.
Right. The structure scales. Three criteria today; thirty if you're shipping serious software. Each one is a small predicate — easy to write, easy to debug, free to run.
What if a criterion fails — do I retry?
For single-output evaluation, you might retry (with feedback, like in L11). For suite evaluation (next week), you don't retry — you record the failure and use it to drive prompt improvements. Evaluation tells you what's broken; iteration fixes it.
output = call_llm(prompt)
criteria = {
"check_1": predicate_1(output),
"check_2": predicate_2(output),
"check_3": predicate_3(output),
}
score = sum(criteria.values()) / len(criteria) # pass rateA dict of name → bool. Pass rate is sum / len. Trivial to compute, trivial to log.
| Family | Examples |
|---|---|
| Length | 5 <= words <= 30, chars < 500 |
| Format | ends with period, JSON-parseable, regex match |
| Content | contains required keyword, mentions specific entity |
| Negative | does NOT contain blocked terms |
| Structure | required keys present, nested types correct |
| Numeric | sum of fields ≈ expected, count of items in range |
All deterministic. All cheap.
We explicitly do not use an LLM as the evaluator. (See feedback_no_llm_judge.md.) Reasons:
If a criterion seems to need an LLM judge ("is this response polite?"), narrow the lesson — write a deterministic check that approximates politeness (no all-caps, no swearwords from a list, length not hostile-short).
One output. Three deterministic checks. All three must pass. Foundation for next week's eval suite (input/expected pairs + the same kind of checking).
Create a free account to get started. Paid plans unlock all tracks.