You wrote v2 of a prompt; your gut says it's better than v1. A/B testing confirms (or disconfirms) by running both against the same eval cases and comparing pass rates.
results = {}
for version in ["v1", "v2"]:
passes = 0
for query, expected in cases:
answer = call_with_version(version, query)
if expected.lower() in answer.lower():
passes += 1
results[version] = passes / len(cases)
print(f"v1: {results['v1']:.0%}")
print(f"v2: {results['v2']:.0%}")
winner = max(results, key=results.get)
print(f"winner: {winner}")That's just running the eval suite twice with different prompts.
Exactly. The infrastructure is the eval suite (week 2); A/B is the protocol — run the same fixed cases through both versions and compare. The decision threshold ("v2 wins by ≥ 5 percentage points") is yours.
And we ship the winner?
Update the prompt registry's ACTIVE_VERSION to v2. The whole point of versioning + A/B testing: ship-or-not decisions become data-driven, not gut-driven.
def ab_test(versions, cases):
results = {}
for v in versions:
passes = 0
for query, expected in cases:
answer = run_with_version(v, query).lower()
if expected.lower() in answer:
passes += 1
results[v] = {
"passes": passes,
"total": len(cases),
"rate": passes / len(cases),
}
return resultsGo beyond pass-rate when you can:
| Metric | Why |
|---|---|
| Pass rate | The headline number |
| Per-case wins | Which cases improved? Which regressed? |
| Latency | v2 may be more accurate but slower |
| Token cost | v2 may be longer prompt + longer answer |
| Refusal rate | Did v2 start refusing things v1 answered? |
A prompt change can win on accuracy and lose on cost. Track both.
With 3 cases, "v2 passed 3/3, v1 passed 2/3" is suggestive but not conclusive. Production A/B testing needs ~30+ cases to distinguish a real difference from sampling noise. For lesson scope we keep N small and accept that the decision protocol is what matters, not the statistical rigor.
Common thresholds:
The specifics depend on how risky the change is. Ship-or-not should be a clear function of the numbers, not a vibe.
The pattern generalizes:
results = ab_test(["v1", "v2", "v3", "v4"], cases)N-way comparisons. Same shape, more rows in the result table. Production tools (PromptLayer, Helicone) automate this — the open-source library promptfoo does it from a YAML config.
You wrote v2 of a prompt; your gut says it's better than v1. A/B testing confirms (or disconfirms) by running both against the same eval cases and comparing pass rates.
results = {}
for version in ["v1", "v2"]:
passes = 0
for query, expected in cases:
answer = call_with_version(version, query)
if expected.lower() in answer.lower():
passes += 1
results[version] = passes / len(cases)
print(f"v1: {results['v1']:.0%}")
print(f"v2: {results['v2']:.0%}")
winner = max(results, key=results.get)
print(f"winner: {winner}")That's just running the eval suite twice with different prompts.
Exactly. The infrastructure is the eval suite (week 2); A/B is the protocol — run the same fixed cases through both versions and compare. The decision threshold ("v2 wins by ≥ 5 percentage points") is yours.
And we ship the winner?
Update the prompt registry's ACTIVE_VERSION to v2. The whole point of versioning + A/B testing: ship-or-not decisions become data-driven, not gut-driven.
def ab_test(versions, cases):
results = {}
for v in versions:
passes = 0
for query, expected in cases:
answer = run_with_version(v, query).lower()
if expected.lower() in answer:
passes += 1
results[v] = {
"passes": passes,
"total": len(cases),
"rate": passes / len(cases),
}
return resultsGo beyond pass-rate when you can:
| Metric | Why |
|---|---|
| Pass rate | The headline number |
| Per-case wins | Which cases improved? Which regressed? |
| Latency | v2 may be more accurate but slower |
| Token cost | v2 may be longer prompt + longer answer |
| Refusal rate | Did v2 start refusing things v1 answered? |
A prompt change can win on accuracy and lose on cost. Track both.
With 3 cases, "v2 passed 3/3, v1 passed 2/3" is suggestive but not conclusive. Production A/B testing needs ~30+ cases to distinguish a real difference from sampling noise. For lesson scope we keep N small and accept that the decision protocol is what matters, not the statistical rigor.
Common thresholds:
The specifics depend on how risky the change is. Ship-or-not should be a clear function of the numbers, not a vibe.
The pattern generalizes:
results = ab_test(["v1", "v2", "v3", "v4"], cases)N-way comparisons. Same shape, more rows in the result table. Production tools (PromptLayer, Helicone) automate this — the open-source library promptfoo does it from a YAML config.
Create a free account to get started. Paid plans unlock all tracks.