Pass/fail is binary. A rubric assigns weights — some criteria matter more than others — and produces a numeric score.
from pydantic_ai import Agent
prompt = 'Write a one-sentence definition of "recursion" for a programming beginner. Mention the word "itself".'
output = Agent(model).run_sync(prompt).output.strip()
rubric = [
# (name, weight, predicate)
("length_ok", 0.2, 8 <= len(output.split()) <= 30),
("ends_period", 0.1, output.endswith(".")),
("mentions_itself", 0.4, "itself" in output.lower()),
("single_sentence", 0.3, output.count(".") <= 1 and output.count("!") == 0 and output.count("?") == 0),
]
total_weight = sum(w for _, w, _ in rubric)
score = sum(w * int(p) for _, w, p in rubric) / total_weight
for name, weight, passed in rubric:
mark = "PASS" if passed else "FAIL"
print(f" {mark} (weight {weight:.1f}) — {name}")
print(f"\nWeighted score: {score:.2f}")Each criterion has a weight. The score is a weighted average of pass/fail. Higher-weight criteria contribute more.
Right. "Mentions itself" is the most important property here (weight 0.4); length and punctuation matter less. The total weight sums to 1.0, so the score is in [0, 1]. Easier to reason about than "3 out of 5" when criteria differ in importance.
When does this beat the simple checklist from L13?
When the criteria aren't equal. For "is the output safe?" you might have one critical criterion (no PII) at weight 0.9 and three nice-to-haves at weight 0.033 each. A flat checklist doesn't capture that priority. A weighted rubric does — and the score signals "close to acceptable" vs "way off" instead of just binary fail.
rubric = [
(name_1, weight_1, predicate_1),
(name_2, weight_2, predicate_2),
...
]
score = sum(w * passed for _, w, passed in rubric) / sum(w for _, w, _ in rubric)A list of (name, weight, bool) tuples. The weighted average is your score in [0, 1].
| Use case | Why weights help |
|---|---|
| Safety-critical — must-haves vs nice-to-haves | Heavy weight on safety criteria; minor weight on style |
| Iteration — track which criteria are improving | Weighted score captures whether high-priority items are getting better |
| Comparison across prompt versions | Numeric score is comparable; pass/fail is too coarse |
| Threshold setting — "acceptable" depends on which criteria pass | A 0.9 score with safety criteria passing is very different from 0.9 with safety failing |
Rubrics are not LLM judges. We're not asking the LLM "score this output 1-10" — that's drift-prone and expensive. We're using deterministic Python predicates with weights to shape the aggregate score. Reproducible. Free.
One LLM-generated definition of "recursion". Four criteria with weights summing to 1.0. Compute the weighted score. Bind it to score. Verification asserts score >= 0.5 (the prompt asks for the keyword, so it should pass mentions_itself reliably).
Pass/fail is binary. A rubric assigns weights — some criteria matter more than others — and produces a numeric score.
from pydantic_ai import Agent
prompt = 'Write a one-sentence definition of "recursion" for a programming beginner. Mention the word "itself".'
output = Agent(model).run_sync(prompt).output.strip()
rubric = [
# (name, weight, predicate)
("length_ok", 0.2, 8 <= len(output.split()) <= 30),
("ends_period", 0.1, output.endswith(".")),
("mentions_itself", 0.4, "itself" in output.lower()),
("single_sentence", 0.3, output.count(".") <= 1 and output.count("!") == 0 and output.count("?") == 0),
]
total_weight = sum(w for _, w, _ in rubric)
score = sum(w * int(p) for _, w, p in rubric) / total_weight
for name, weight, passed in rubric:
mark = "PASS" if passed else "FAIL"
print(f" {mark} (weight {weight:.1f}) — {name}")
print(f"\nWeighted score: {score:.2f}")Each criterion has a weight. The score is a weighted average of pass/fail. Higher-weight criteria contribute more.
Right. "Mentions itself" is the most important property here (weight 0.4); length and punctuation matter less. The total weight sums to 1.0, so the score is in [0, 1]. Easier to reason about than "3 out of 5" when criteria differ in importance.
When does this beat the simple checklist from L13?
When the criteria aren't equal. For "is the output safe?" you might have one critical criterion (no PII) at weight 0.9 and three nice-to-haves at weight 0.033 each. A flat checklist doesn't capture that priority. A weighted rubric does — and the score signals "close to acceptable" vs "way off" instead of just binary fail.
rubric = [
(name_1, weight_1, predicate_1),
(name_2, weight_2, predicate_2),
...
]
score = sum(w * passed for _, w, passed in rubric) / sum(w for _, w, _ in rubric)A list of (name, weight, bool) tuples. The weighted average is your score in [0, 1].
| Use case | Why weights help |
|---|---|
| Safety-critical — must-haves vs nice-to-haves | Heavy weight on safety criteria; minor weight on style |
| Iteration — track which criteria are improving | Weighted score captures whether high-priority items are getting better |
| Comparison across prompt versions | Numeric score is comparable; pass/fail is too coarse |
| Threshold setting — "acceptable" depends on which criteria pass | A 0.9 score with safety criteria passing is very different from 0.9 with safety failing |
Rubrics are not LLM judges. We're not asking the LLM "score this output 1-10" — that's drift-prone and expensive. We're using deterministic Python predicates with weights to shape the aggregate score. Reproducible. Free.
One LLM-generated definition of "recursion". Four criteria with weights summing to 1.0. Compute the weighted score. Bind it to score. Verification asserts score >= 0.5 (the prompt asks for the keyword, so it should pass mentions_itself reliably).
Create a free account to get started. Paid plans unlock all tracks.