Retry on bad output — Ai Foundations | zuzu.codes

Retry on bad output — Ai Foundations | zuzu.codes

Day 25 · ~12 min●

Day 12 (parse failures) covered the basic pattern: try to parse, catch the error, re-call with stricter wording. Today, generalise to any shape mismatch — not just JSON parse errors.

The pattern: validate the response, retry up to N times, each retry tightens the prompt:

python

import json

ALLOWED = {"positive", "negative"}

base_prompt = 'Classify: "It was fine." Reply with one word: positive or negative.'
prompt = base_prompt

for attempt in range(3):
    out = Agent(model).run_sync(prompt).output.strip().strip(".").lower()
    if out in ALLOWED:
        break
    prompt = base_prompt + f'\nThe previous answer "{out}" was not in the allowed set. Reply with EXACTLY one of: positive, negative.'
else:
    raise ValueError(f"could not get valid label after 3 attempts: {out!r}")

print(out)

The retry message tells the model what was wrong — that's the trick?

Yes. Generic "please try again" rarely helps. Telling the model exactly what was wrong ("the previous answer 'maybe' was not in the allowed set") points it at the fix. Then the second call usually succeeds.

When is retry the right move vs a different prompt entirely?

If you need 2-3 retries to get a valid response, the prompt is the problem, not the model. Rewrite the original. Retry is for the occasional drift, not the systematic failure.

Retry on bad output

python

for attempt in range(MAX):
    out = call(prompt)
    if valid(out):
        break
    prompt = clarify(prompt, out)    # tell the model what went wrong
else:
    raise ValueError("all attempts produced invalid output")

Three pieces:

A validator — function that returns True if the output is acceptable.
A clarifier — function that rewrites the prompt with feedback about the bad attempt.
A cap — MAX attempts. After that, fail loudly so you can fix the prompt.

What kinds of "bad" can you check

Validator	What it catches
`json.loads` succeeds	Output is valid JSON
Required keys present	Schema match
Values in allowed set	Closed-set classification
Output length within range	Prevents runaway responses
Output matches regex	Date format, ID format, etc.
Custom rules	Sum of fields equals total, etc.

All of these are pure Python. The validator runs in your code; the model just produces text.

Composes with the helpers from L23

python

def classify(text, labels, system=None, max_attempts=3):
    label_str = " / ".join(f'"{l}"' for l in labels)
    base = f'Classify the input as exactly one of: {label_str}. Reply with only that single word.\n\nInput: {text}'
    prompt = base
    for attempt in range(max_attempts):
        out = ask(prompt, system).strip().strip(".").lower()
        if out in [l.lower() for l in labels]:
            return out
        prompt = base + f'\nThe previous answer "{out}" was not in the allowed set. Reply with EXACTLY one of: {label_str}.'
    raise ValueError(f"could not get valid label: {out!r}")

Retry-on-bad-output now lives inside the classify helper. Every call site gets validation for free.

When NOT to retry

Bug, not drift. If the model is consistently producing the same wrong shape, the prompt is wrong. Fix the prompt; retry will just waste quota.
Soft failures. "The model said 'maybe' instead of 'positive'" — sometimes the right move is to map maybe → neutral rather than retry. If you can rescue the bad output with code, do that.
Cost-sensitive batch jobs. If you're calling 1000 times, even a 5% retry rate adds 50 calls. Either accept some failures and post-process, or fix the prompt before the batch.

Soft retries — temperature variation

Pydantic-AI's model_settings lets you tweak temperature on a retry. Higher temperature = more randomness = different output, which sometimes shakes the model out of a wrong-format groove. Beyond beginner scope, but worth knowing exists.

Day 25 · ~12 min●

Day 12 (parse failures) covered the basic pattern: try to parse, catch the error, re-call with stricter wording. Today, generalise to any shape mismatch — not just JSON parse errors.

The pattern: validate the response, retry up to N times, each retry tightens the prompt:

python

import json

ALLOWED = {"positive", "negative"}

base_prompt = 'Classify: "It was fine." Reply with one word: positive or negative.'
prompt = base_prompt

for attempt in range(3):
    out = Agent(model).run_sync(prompt).output.strip().strip(".").lower()
    if out in ALLOWED:
        break
    prompt = base_prompt + f'\nThe previous answer "{out}" was not in the allowed set. Reply with EXACTLY one of: positive, negative.'
else:
    raise ValueError(f"could not get valid label after 3 attempts: {out!r}")

print(out)

The retry message tells the model what was wrong — that's the trick?

Yes. Generic "please try again" rarely helps. Telling the model exactly what was wrong ("the previous answer 'maybe' was not in the allowed set") points it at the fix. Then the second call usually succeeds.

When is retry the right move vs a different prompt entirely?

If you need 2-3 retries to get a valid response, the prompt is the problem, not the model. Rewrite the original. Retry is for the occasional drift, not the systematic failure.

Retry on bad output

python

for attempt in range(MAX):
    out = call(prompt)
    if valid(out):
        break
    prompt = clarify(prompt, out)    # tell the model what went wrong
else:
    raise ValueError("all attempts produced invalid output")

Three pieces:

A validator — function that returns True if the output is acceptable.
A clarifier — function that rewrites the prompt with feedback about the bad attempt.
A cap — MAX attempts. After that, fail loudly so you can fix the prompt.

What kinds of "bad" can you check

Validator	What it catches
`json.loads` succeeds	Output is valid JSON
Required keys present	Schema match
Values in allowed set	Closed-set classification
Output length within range	Prevents runaway responses
Output matches regex	Date format, ID format, etc.
Custom rules	Sum of fields equals total, etc.

All of these are pure Python. The validator runs in your code; the model just produces text.

Composes with the helpers from L23

python

def classify(text, labels, system=None, max_attempts=3):
    label_str = " / ".join(f'"{l}"' for l in labels)
    base = f'Classify the input as exactly one of: {label_str}. Reply with only that single word.\n\nInput: {text}'
    prompt = base
    for attempt in range(max_attempts):
        out = ask(prompt, system).strip().strip(".").lower()
        if out in [l.lower() for l in labels]:
            return out
        prompt = base + f'\nThe previous answer "{out}" was not in the allowed set. Reply with EXACTLY one of: {label_str}.'
    raise ValueError(f"could not get valid label: {out!r}")

Retry-on-bad-output now lives inside the classify helper. Every call site gets validation for free.

When NOT to retry

Bug, not drift. If the model is consistently producing the same wrong shape, the prompt is wrong. Fix the prompt; retry will just waste quota.
Soft failures. "The model said 'maybe' instead of 'positive'" — sometimes the right move is to map maybe → neutral rather than retry. If you can rescue the bad output with code, do that.
Cost-sensitive batch jobs. If you're calling 1000 times, even a 5% retry rate adds 50 calls. Either accept some failures and post-process, or fix the prompt before the batch.

Soft retries — temperature variation

Pydantic-AI's model_settings lets you tweak temperature on a retry. Higher temperature = more randomness = different output, which sometimes shakes the model out of a wrong-format groove. Beyond beginner scope, but worth knowing exists.