Once an LLM is feeding into other code or systems, "mostly correct" stops being acceptable. Validate the output with deterministic Python before you act on it.
import re
from pydantic_ai import Agent
PHONE_RE = re.compile(r"^\(\d{3}\) \d{3}-\d{4}$") # "(123) 456-7890"
base = 'Generate a fake US phone number in the format "(XXX) XXX-XXXX". Reply with just the number, nothing else.'
prompt = base
for attempt in range(3):
out = Agent(model).run_sync(prompt).output.strip()
if PHONE_RE.match(out):
break
prompt = base + f'\nThe previous answer "{out}" did not match the format. Reply with EXACTLY the format (XXX) XXX-XXXX, no other text.'
else:
raise ValueError(f"could not produce valid phone after 3 attempts: {out!r}")
print(out)The validator is PHONE_RE.match. The retry message tells the model what failed. Three tries, then bail.
Right. The validator is pure Python — fast, deterministic, free. The retries are LLM calls — they cost quota, so cap them. Three is usually enough; if you need five, the original prompt is the problem, not the model.
What kinds of validators are useful?
Anything you can express as a Python check. Regex match. JSON parse + key check. Length within bounds. Value in allowed set. Sum of fields equals total. Custom rules. The validator can be one line or twenty — the pattern is the same: validate, retry on failure with feedback, cap the loop.
for attempt in range(MAX):
out = call(prompt)
if valid(out):
break
prompt = clarify(prompt, out)
else:
raise ValueError("all attempts produced invalid output")Three pieces:
True if the output is acceptable.MAX attempts. After that, fail loudly.| Validator | Catches |
|---|---|
| Regex match | Format (phone, date, ID) |
json.loads succeeds | JSON parse |
| Required keys present + types correct | Schema match |
| Value in allowed set | Closed-set classification |
| Length in range | Prevents runaway responses |
| Custom: sum of fields equals total | Internal consistency |
Generic "please try again" rarely helps. Telling the model what was wrong:
The previous answer
"123-456-7890"did not match the format(XXX) XXX-XXXX— note the parentheses around the area code.
...points the model at the specific fix. Concrete > vague.
Unbounded retries can:
Three retries is the sweet spot. If three didn't work, your prompt is the issue.
LLMs sample probabilistically. The same prompt produces different outputs each call. The single biggest reliability upgrade is wrapping every LLM call in a validator — cheap to write, free to run, catches the 5% drift that would otherwise propagate downstream.
Week 4's retry-on-bad-output extends this with custom validation functions and pydantic-AI's native output_type= support.
Once an LLM is feeding into other code or systems, "mostly correct" stops being acceptable. Validate the output with deterministic Python before you act on it.
import re
from pydantic_ai import Agent
PHONE_RE = re.compile(r"^\(\d{3}\) \d{3}-\d{4}$") # "(123) 456-7890"
base = 'Generate a fake US phone number in the format "(XXX) XXX-XXXX". Reply with just the number, nothing else.'
prompt = base
for attempt in range(3):
out = Agent(model).run_sync(prompt).output.strip()
if PHONE_RE.match(out):
break
prompt = base + f'\nThe previous answer "{out}" did not match the format. Reply with EXACTLY the format (XXX) XXX-XXXX, no other text.'
else:
raise ValueError(f"could not produce valid phone after 3 attempts: {out!r}")
print(out)The validator is PHONE_RE.match. The retry message tells the model what failed. Three tries, then bail.
Right. The validator is pure Python — fast, deterministic, free. The retries are LLM calls — they cost quota, so cap them. Three is usually enough; if you need five, the original prompt is the problem, not the model.
What kinds of validators are useful?
Anything you can express as a Python check. Regex match. JSON parse + key check. Length within bounds. Value in allowed set. Sum of fields equals total. Custom rules. The validator can be one line or twenty — the pattern is the same: validate, retry on failure with feedback, cap the loop.
for attempt in range(MAX):
out = call(prompt)
if valid(out):
break
prompt = clarify(prompt, out)
else:
raise ValueError("all attempts produced invalid output")Three pieces:
True if the output is acceptable.MAX attempts. After that, fail loudly.| Validator | Catches |
|---|---|
| Regex match | Format (phone, date, ID) |
json.loads succeeds | JSON parse |
| Required keys present + types correct | Schema match |
| Value in allowed set | Closed-set classification |
| Length in range | Prevents runaway responses |
| Custom: sum of fields equals total | Internal consistency |
Generic "please try again" rarely helps. Telling the model what was wrong:
The previous answer
"123-456-7890"did not match the format(XXX) XXX-XXXX— note the parentheses around the area code.
...points the model at the specific fix. Concrete > vague.
Unbounded retries can:
Three retries is the sweet spot. If three didn't work, your prompt is the issue.
LLMs sample probabilistically. The same prompt produces different outputs each call. The single biggest reliability upgrade is wrapping every LLM call in a validator — cheap to write, free to run, catches the 5% drift that would otherwise propagate downstream.
Week 4's retry-on-bad-output extends this with custom validation functions and pydantic-AI's native output_type= support.
Create a free account to get started. Paid plans unlock all tracks.