Sampling means the same prompt can return different answers (week 1, day 2). For most tasks one call is fine. But for high-stakes classifications, you can call N times and vote.
The pattern: call N times, count occurrences, return the most common:
from collections import Counter
from pydantic_ai import Agent
def classify_majority(text, n=5):
prompt = f'Classify: "{text}". Reply with one word: positive or negative.'
answers = []
for _ in range(n):
out = Agent(model).run_sync(prompt).output.strip().strip(".").lower()
if out in {"positive", "negative"}:
answers.append(out)
counter = Counter(answers)
most_common = counter.most_common(1)[0][0]
return most_common, dict(counter)
label, votes = classify_majority("It's fine, I guess.", n=5)
print(f"final: {label} votes: {votes}")5 calls = 5 quota slots. Worth it?
Only when the cost of wrong is high. Routing 1000 customer support tickets and a wrong label costs you a misrouted ticket — probably one call each is fine. Diagnosing a medical symptom from notes — even 5x cost is cheap insurance against drift. Self-consistency is a judgment call about cost vs reliability.
Does temperature affect this?
Yes — slightly higher temperature (e.g., 0.7) gives more sampling variation, so disagreement actually surfaces. With temperature 0 (deterministic), all 5 calls return the same answer and self-consistency is wasted. The pattern shines when there's genuine uncertainty.
The idea: when one LLM call has uncertainty, N calls and a majority vote often beat one call.
from collections import Counter
answers = [call(prompt) for _ in range(N)]
final = Counter(answers).most_common(1)[0][0]Counter(...) counts occurrences. .most_common(1) returns the most-frequent element. Tie-breaks go to insertion order.
Budget for self-consistency:
total cost ≈ N × (one-call cost)
With N=5, you're spending 5× per item. For a 1000-item batch that's 5000 calls. Compare against:
Reach for self-consistency only when the cost of wrong is much higher than the cost of 5×.
classify_majority is a helper that wraps the vote. Every call site benefits.Sampling means the same prompt can return different answers (week 1, day 2). For most tasks one call is fine. But for high-stakes classifications, you can call N times and vote.
The pattern: call N times, count occurrences, return the most common:
from collections import Counter
from pydantic_ai import Agent
def classify_majority(text, n=5):
prompt = f'Classify: "{text}". Reply with one word: positive or negative.'
answers = []
for _ in range(n):
out = Agent(model).run_sync(prompt).output.strip().strip(".").lower()
if out in {"positive", "negative"}:
answers.append(out)
counter = Counter(answers)
most_common = counter.most_common(1)[0][0]
return most_common, dict(counter)
label, votes = classify_majority("It's fine, I guess.", n=5)
print(f"final: {label} votes: {votes}")5 calls = 5 quota slots. Worth it?
Only when the cost of wrong is high. Routing 1000 customer support tickets and a wrong label costs you a misrouted ticket — probably one call each is fine. Diagnosing a medical symptom from notes — even 5x cost is cheap insurance against drift. Self-consistency is a judgment call about cost vs reliability.
Does temperature affect this?
Yes — slightly higher temperature (e.g., 0.7) gives more sampling variation, so disagreement actually surfaces. With temperature 0 (deterministic), all 5 calls return the same answer and self-consistency is wasted. The pattern shines when there's genuine uncertainty.
The idea: when one LLM call has uncertainty, N calls and a majority vote often beat one call.
from collections import Counter
answers = [call(prompt) for _ in range(N)]
final = Counter(answers).most_common(1)[0][0]Counter(...) counts occurrences. .most_common(1) returns the most-frequent element. Tie-breaks go to insertion order.
Budget for self-consistency:
total cost ≈ N × (one-call cost)
With N=5, you're spending 5× per item. For a 1000-item batch that's 5000 calls. Compare against:
Reach for self-consistency only when the cost of wrong is much higher than the cost of 5×.
classify_majority is a helper that wraps the vote. Every call site benefits.Create a free account to get started. Paid plans unlock all tracks.