How much does zuzu.codes cost?

The starter track is free — read all lessons and practice for free. Full access to every track (current and future) is $14.99/month. Cancel anytime.

How long does each track take?

Each track is designed as a 30-day challenge — one lesson per day, about 15 minutes each. Go at your own pace, but the structure is built around daily consistency.

What's the lesson format?

Each lesson is a student-teacher dialogue with code examples, followed by a hands-on code challenge in an in-browser editor. You read, you understand, then you write real code.

Do I need prior coding experience?

Our beginner track starts from absolute zero — no prior experience needed. Advanced tracks build on earlier ones, and the platform tells you exactly where to start.

How is zuzu.codes different from freeCodeCamp or Codecademy?

zuzu.codes uses a structured 30-day track format with dialogue-based teaching, an in-browser code editor, and gamification (XP, streaks, progress tracking). The format builds genuine understanding through daily practice.

A/B prompt testing — Ai Mastery

Day 24 · ~11 min●

You wrote v2 of a prompt; your gut says it's better than v1. A/B testing confirms (or disconfirms) by running both against the same eval cases and comparing pass rates.

python

results = {}
for version in ["v1", "v2"]:
    passes = 0
    for query, expected in cases:
        answer = call_with_version(version, query)
        if expected.lower() in answer.lower():
            passes += 1
    results[version] = passes / len(cases)

print(f"v1: {results['v1']:.0%}")
print(f"v2: {results['v2']:.0%}")
winner = max(results, key=results.get)
print(f"winner: {winner}")

That's just running the eval suite twice with different prompts.

Exactly. The infrastructure is the eval suite (week 2); A/B is the protocol — run the same fixed cases through both versions and compare. The decision threshold ("v2 wins by ≥ 5 percentage points") is yours.

And we ship the winner?

Update the prompt registry's ACTIVE_VERSION to v2. The whole point of versioning + A/B testing: ship-or-not decisions become data-driven, not gut-driven.

A/B prompt testing

python

def ab_test(versions, cases):
    results = {}
    for v in versions:
        passes = 0
        for query, expected in cases:
            answer = run_with_version(v, query).lower()
            if expected.lower() in answer:
                passes += 1
        results[v] = {
            "passes": passes,
            "total": len(cases),
            "rate": passes / len(cases),
        }
    return results

Per-version metrics

Go beyond pass-rate when you can:

Metric	Why
Pass rate	The headline number
Per-case wins	Which cases improved? Which regressed?
Latency	v2 may be more accurate but slower
Token cost	v2 may be longer prompt + longer answer
Refusal rate	Did v2 start refusing things v1 answered?

A prompt change can win on accuracy and lose on cost. Track both.

Statistical significance

With 3 cases, "v2 passed 3/3, v1 passed 2/3" is suggestive but not conclusive. Production A/B testing needs ~30+ cases to distinguish a real difference from sampling noise. For lesson scope we keep N small and accept that the decision protocol is what matters, not the statistical rigor.

Ship-or-not threshold

Common thresholds:

v2 wins outright — v2.rate > v1.rate by ≥ 5 percentage points → ship v2
Tie — within 5 points → keep v1 (simpler is better; only switch when there's real evidence)
v2 regresses — v2.rate < v1.rate → don't ship

The specifics depend on how risky the change is. Ship-or-not should be a clear function of the numbers, not a vibe.

Beyond two versions

The pattern generalizes:

python

results = ab_test(["v1", "v2", "v3", "v4"], cases)

N-way comparisons. Same shape, more rows in the result table. Production tools (PromptLayer, Helicone) automate this — the open-source library promptfoo does it from a YAML config.

Day 24 · ~11 min●

You wrote v2 of a prompt; your gut says it's better than v1. A/B testing confirms (or disconfirms) by running both against the same eval cases and comparing pass rates.

python

results = {}
for version in ["v1", "v2"]:
    passes = 0
    for query, expected in cases:
        answer = call_with_version(version, query)
        if expected.lower() in answer.lower():
            passes += 1
    results[version] = passes / len(cases)

print(f"v1: {results['v1']:.0%}")
print(f"v2: {results['v2']:.0%}")
winner = max(results, key=results.get)
print(f"winner: {winner}")

That's just running the eval suite twice with different prompts.

And we ship the winner?

Update the prompt registry's ACTIVE_VERSION to v2. The whole point of versioning + A/B testing: ship-or-not decisions become data-driven, not gut-driven.

A/B prompt testing

python

def ab_test(versions, cases):
    results = {}
    for v in versions:
        passes = 0
        for query, expected in cases:
            answer = run_with_version(v, query).lower()
            if expected.lower() in answer:
                passes += 1
        results[v] = {
            "passes": passes,
            "total": len(cases),
            "rate": passes / len(cases),
        }
    return results

Per-version metrics

Go beyond pass-rate when you can:

Metric	Why
Pass rate	The headline number
Per-case wins	Which cases improved? Which regressed?
Latency	v2 may be more accurate but slower
Token cost	v2 may be longer prompt + longer answer
Refusal rate	Did v2 start refusing things v1 answered?

A prompt change can win on accuracy and lose on cost. Track both.

Statistical significance

Ship-or-not threshold

Common thresholds:

v2 wins outright — v2.rate > v1.rate by ≥ 5 percentage points → ship v2
Tie — within 5 points → keep v1 (simpler is better; only switch when there's real evidence)
v2 regresses — v2.rate < v1.rate → don't ship

The specifics depend on how risky the change is. Ship-or-not should be a clear function of the numbers, not a vibe.

Beyond two versions

The pattern generalizes:

python

results = ab_test(["v1", "v2", "v3", "v4"], cases)

N-way comparisons. Same shape, more rows in the result table. Production tools (PromptLayer, Helicone) automate this — the open-source library promptfoo does it from a YAML config.

A/B prompt testing

Per-version metrics

Statistical significance

Ship-or-not threshold

Beyond two versions

A/B prompt testing

Per-version metrics

Statistical significance

Ship-or-not threshold

Beyond two versions

A/B prompt testing

Per-version metrics

Statistical significance

Ship-or-not threshold

Beyond two versions

Sign up to practice

A/B prompt testing

Per-version metrics

Statistical significance

Ship-or-not threshold

Beyond two versions

Sign up to practice