Yesterday: a suite that scores. Today: use the suite to drive prompt improvement. The discipline: change one thing, measure, keep what helped.
from pydantic_ai import Agent
cases = [
("the cat sleeps", "animal"),
("the engine roared", "machine"),
("the puppy wagged its tail", "animal"),
]
def score_prompt(prompt_template):
passed = 0
for sentence, expected in cases:
out = Agent(model).run_sync(prompt_template.format(sentence=sentence)).output.strip().strip(".").lower()
if out == expected:
passed += 1
return passed
before_template = "What is this about? {sentence}"
after_template = (
"Classify the subject of this sentence as exactly one word: "
'"animal" or "machine". Reply with only the single word.\n\n'
"Sentence: {sentence}"
)
before = score_prompt(before_template)
after = score_prompt(after_template)
print(f"BEFORE: {before} / {len(cases)} passed")
print(f"AFTER: {after} / {len(cases)} passed")The vague "what is this about?" gives free-form answers, so they don't match animal or machine exactly. The tightened prompt forces the closed-set output and matches.
Right. The eval-driven loop: 1) write prompt, 2) run suite, 3) inspect failures, 4) tweak one thing, 5) re-run. Repeat until pass rate is acceptable. The suite is the ground truth — it tells you whether your tweak helped.
What if I tweak two things at once and the score improves?
You can't attribute the win. One change at a time is the discipline. If the score got better, was it the constraint to a closed set, or the explicit format instruction, or both? The suite can't tell you. So you only change one variable per iteration.
write prompt v1
down
run eval suite
down
score (e.g., 2/5)
down
inspect failures - what's going wrong?
down
change ONE thing - sharper instruction, examples, format
down
run eval suite
down
score (e.g., 4/5)
down
keep change if better, revert if worse
down
(loop)
The eval suite is the judge. Your prompt iteration is the experiment. Without the judge, prompt iteration is vibes; with it, the iteration is empirical.
| Anti-pattern | Better |
|---|---|
| Rewrite the whole prompt | Add ONE constraint, re-test |
| Add three examples and a system prompt | Add one example, re-test |
| Tweak temperature AND prompt | Tweak one, then the other |
Isolate the variable. Otherwise you can't attribute the improvement, and you'll re-add hurtful changes later thinking they helped.
After a run, look at the FAIL cases:
When the marginal improvement isn't worth the time. If iteration 5 took you from 8/10 to 9/10 in two hours, that's good. If iteration 6 takes another two hours to maybe go to 9.1/10, ship it.
Three cases. Two prompt versions. The vague version probably scores 0-1; the tight version probably scores 3. Verification asserts after >= before and after >= 2.
Yesterday: a suite that scores. Today: use the suite to drive prompt improvement. The discipline: change one thing, measure, keep what helped.
from pydantic_ai import Agent
cases = [
("the cat sleeps", "animal"),
("the engine roared", "machine"),
("the puppy wagged its tail", "animal"),
]
def score_prompt(prompt_template):
passed = 0
for sentence, expected in cases:
out = Agent(model).run_sync(prompt_template.format(sentence=sentence)).output.strip().strip(".").lower()
if out == expected:
passed += 1
return passed
before_template = "What is this about? {sentence}"
after_template = (
"Classify the subject of this sentence as exactly one word: "
'"animal" or "machine". Reply with only the single word.\n\n'
"Sentence: {sentence}"
)
before = score_prompt(before_template)
after = score_prompt(after_template)
print(f"BEFORE: {before} / {len(cases)} passed")
print(f"AFTER: {after} / {len(cases)} passed")The vague "what is this about?" gives free-form answers, so they don't match animal or machine exactly. The tightened prompt forces the closed-set output and matches.
Right. The eval-driven loop: 1) write prompt, 2) run suite, 3) inspect failures, 4) tweak one thing, 5) re-run. Repeat until pass rate is acceptable. The suite is the ground truth — it tells you whether your tweak helped.
What if I tweak two things at once and the score improves?
You can't attribute the win. One change at a time is the discipline. If the score got better, was it the constraint to a closed set, or the explicit format instruction, or both? The suite can't tell you. So you only change one variable per iteration.
write prompt v1
down
run eval suite
down
score (e.g., 2/5)
down
inspect failures - what's going wrong?
down
change ONE thing - sharper instruction, examples, format
down
run eval suite
down
score (e.g., 4/5)
down
keep change if better, revert if worse
down
(loop)
The eval suite is the judge. Your prompt iteration is the experiment. Without the judge, prompt iteration is vibes; with it, the iteration is empirical.
| Anti-pattern | Better |
|---|---|
| Rewrite the whole prompt | Add ONE constraint, re-test |
| Add three examples and a system prompt | Add one example, re-test |
| Tweak temperature AND prompt | Tweak one, then the other |
Isolate the variable. Otherwise you can't attribute the improvement, and you'll re-add hurtful changes later thinking they helped.
After a run, look at the FAIL cases:
When the marginal improvement isn't worth the time. If iteration 5 took you from 8/10 to 9/10 in two hours, that's good. If iteration 6 takes another two hours to maybe go to 9.1/10, ship it.
Three cases. Two prompt versions. The vague version probably scores 0-1; the tight version probably scores 3. Verification asserts after >= before and after >= 2.
Create a free account to get started. Paid plans unlock all tracks.