How much does zuzu.codes cost?

The starter track is free — read all lessons and practice for free. Full access to every track (current and future) is $14.99/month. Cancel anytime.

How long does each track take?

Each track is designed as a 30-day challenge — one lesson per day, about 15 minutes each. Go at your own pace, but the structure is built around daily consistency.

What's the lesson format?

Each lesson is a student-teacher dialogue with code examples, followed by a hands-on code challenge in an in-browser editor. You read, you understand, then you write real code.

Do I need prior coding experience?

Our beginner track starts from absolute zero — no prior experience needed. Advanced tracks build on earlier ones, and the platform tells you exactly where to start.

How is zuzu.codes different from freeCodeCamp or Codecademy?

zuzu.codes uses a structured 30-day track format with dialogue-based teaching, an in-browser code editor, and gamification (XP, streaks, progress tracking). The format builds genuine understanding through daily practice.

Eval suite at scale — Ai Mastery

Day 11 · ~11 min●

AI Patterns introduced eval — a few hand-checked examples. Eval at scale is the same idea, more cases, automated pass rate. The shape:

python

eval_cases = [
    ("What's the capital of France?", "paris"),
    ("What's the largest ocean?", "pacific"),
    ("Where is Mount Everest?", "himalaya"),
]

passes = 0
for query, expected in eval_cases:
    answer = pipeline(query)
    if expected.lower() in answer.lower():
        passes += 1

rate = passes / len(eval_cases)
print(f"pass rate: {rate:.0%}")
assert rate >= 0.66, f"regression — {passes}/{len(eval_cases)} passed"

Substring matching is good enough?

For factual recall — yes, surprisingly often. "Did the answer mention Paris?" is a useful signal. For style / tone / multi-fact answers, you'd need richer checks (regex, structured-output validation, or a second LLM as judge — though we don't use llm_judge in this curriculum).

And we run this every prompt change?

Yes. Eval suite is the seatbelt — it catches regressions before they ship. CI runs the eval, blocks the merge if pass rate drops.

Eval suite at scale

python

def run_eval(pipeline, cases, threshold=0.66):
    passes = 0
    failures = []
    for query, expected in cases:
        answer = pipeline(query).lower()
        if expected.lower() in answer:
            passes += 1
        else:
            failures.append((query, expected, answer))

    rate = passes / len(cases)
    print(f"{passes}/{len(cases)} passed ({rate:.0%})")
    for q, exp, ans in failures:
        print(f"  FAIL: {q!r} expected {exp!r} in answer")
    if rate < threshold:
        raise AssertionError(f"eval below threshold: {rate:.0%} < {threshold:.0%}")
    return rate

What an eval case looks like

python

(query, expected_substring)

The simplest matcher: is the expected substring in the answer? Cheap to write, cheap to check. Captures "the model answered correctly" for fact-recall queries.

Richer matchers

When substring isn't enough:

Matcher	When
Regex	Date format, ID format, structured output
`json.loads` + key check	The answer is JSON; check field present
Set membership	Answer is one of an allowed set
Multi-substring (all-must-appear)	Multi-fact answers
Pydantic validation	Typed structured output

The threshold

The assertion rate >= threshold is what makes the suite block on regression. Without it, eval is just a printout — a human has to notice. With it, CI / a build pipeline catches the regression for you.

For a brittle pipeline, 100% threshold is too strict; for a flaky one, 50% lets bugs through. 66–80% is typical — enough to catch regressions, lenient enough to absorb LLM stochasticity.

Cost-aware eval

Real eval suites have hundreds of cases. Each is one LLM call. Run the full suite on every prompt change → expensive. Common pattern:

Smoke (fast) — 10 cases, every commit
Full (slow) — 200 cases, nightly or pre-release

For this lesson we run a 3-case smoke (cost-aware authoring) — same shape, smaller numbers.

Day 11 · ~11 min●

AI Patterns introduced eval — a few hand-checked examples. Eval at scale is the same idea, more cases, automated pass rate. The shape:

python

eval_cases = [
    ("What's the capital of France?", "paris"),
    ("What's the largest ocean?", "pacific"),
    ("Where is Mount Everest?", "himalaya"),
]

passes = 0
for query, expected in eval_cases:
    answer = pipeline(query)
    if expected.lower() in answer.lower():
        passes += 1

rate = passes / len(eval_cases)
print(f"pass rate: {rate:.0%}")
assert rate >= 0.66, f"regression — {passes}/{len(eval_cases)} passed"

Substring matching is good enough?

And we run this every prompt change?

Yes. Eval suite is the seatbelt — it catches regressions before they ship. CI runs the eval, blocks the merge if pass rate drops.

Eval suite at scale

python

def run_eval(pipeline, cases, threshold=0.66):
    passes = 0
    failures = []
    for query, expected in cases:
        answer = pipeline(query).lower()
        if expected.lower() in answer:
            passes += 1
        else:
            failures.append((query, expected, answer))

    rate = passes / len(cases)
    print(f"{passes}/{len(cases)} passed ({rate:.0%})")
    for q, exp, ans in failures:
        print(f"  FAIL: {q!r} expected {exp!r} in answer")
    if rate < threshold:
        raise AssertionError(f"eval below threshold: {rate:.0%} < {threshold:.0%}")
    return rate

What an eval case looks like

python

(query, expected_substring)

The simplest matcher: is the expected substring in the answer? Cheap to write, cheap to check. Captures "the model answered correctly" for fact-recall queries.

Richer matchers

When substring isn't enough:

Matcher	When
Regex	Date format, ID format, structured output
`json.loads` + key check	The answer is JSON; check field present
Set membership	Answer is one of an allowed set
Multi-substring (all-must-appear)	Multi-fact answers
Pydantic validation	Typed structured output

The threshold

For a brittle pipeline, 100% threshold is too strict; for a flaky one, 50% lets bugs through. 66–80% is typical — enough to catch regressions, lenient enough to absorb LLM stochasticity.

Cost-aware eval

Real eval suites have hundreds of cases. Each is one LLM call. Run the full suite on every prompt change → expensive. Common pattern:

Smoke (fast) — 10 cases, every commit
Full (slow) — 200 cases, nightly or pre-release

For this lesson we run a 3-case smoke (cost-aware authoring) — same shape, smaller numbers.

Eval suite at scale

What an eval case looks like

Richer matchers

The threshold

Cost-aware eval

Eval suite at scale

What an eval case looks like

Richer matchers

The threshold

Cost-aware eval

Eval suite at scale

What an eval case looks like

Richer matchers

The threshold

Cost-aware eval

Sign up to practice

Eval suite at scale

What an eval case looks like

Richer matchers

The threshold

Cost-aware eval

Sign up to practice