How much does zuzu.codes cost?

The starter track is free — read all lessons and practice for free. Full access to every track (current and future) is $14.99/month. Cancel anytime.

How long does each track take?

Each track is designed as a 30-day challenge — one lesson per day, about 15 minutes each. Go at your own pace, but the structure is built around daily consistency.

What's the lesson format?

Each lesson is a student-teacher dialogue with code examples, followed by a hands-on code challenge in an in-browser editor. You read, you understand, then you write real code.

Do I need prior coding experience?

Our beginner track starts from absolute zero — no prior experience needed. Advanced tracks build on earlier ones, and the platform tells you exactly where to start.

How is zuzu.codes different from freeCodeCamp or Codecademy?

zuzu.codes uses a structured 30-day track format with dialogue-based teaching, an in-browser code editor, and gamification (XP, streaks, progress tracking). The format builds genuine understanding through daily practice.

Eval criteria — Ai Patterns

Day 14 · ~11 min●

"Did the LLM do well?" is a vague question. Did the output meet these specific criteria? is a question your code can answer. Today: a tiny eval checklist for one output.

python

from pydantic_ai import Agent

prompt = "Write a one-sentence elevator pitch for a library book club. Mention the word 'book'."
output = Agent(model).run_sync(prompt).output.strip()

# Three criteria
criteria = {
    "length_ok": 5 <= len(output.split()) <= 30,           # 5-30 words
    "ends_with_period": output.endswith("."),               # format check
    "contains_book": "book" in output.lower(),              # keyword check
}

for name, passed in criteria.items():
    print(f"  {'PASS' if passed else 'FAIL'} — {name}")

assert all(criteria.values()), f"failed: {criteria}"

Each criterion is a Python check. Pass/fail is bool. The eval is a dict.

Right. The structure scales. Three criteria today; thirty if you're shipping serious software. Each one is a small predicate — easy to write, easy to debug, free to run.

What if a criterion fails — do I retry?

For single-output evaluation, you might retry (with feedback, like in L11). For suite evaluation (next week), you don't retry — you record the failure and use it to drive prompt improvements. Evaluation tells you what's broken; iteration fixes it.

Eval criteria — checklist for one output

python

output = call_llm(prompt)

criteria = {
    "check_1": predicate_1(output),
    "check_2": predicate_2(output),
    "check_3": predicate_3(output),
}

score = sum(criteria.values()) / len(criteria)   # pass rate

A dict of name → bool. Pass rate is sum / len. Trivial to compute, trivial to log.

Useful criterion families

Family	Examples
Length	`5 <= words <= 30`, `chars < 500`
Format	ends with period, JSON-parseable, regex match
Content	contains required keyword, mentions specific entity
Negative	does NOT contain blocked terms
Structure	required keys present, nested types correct
Numeric	sum of fields ≈ expected, count of items in range

All deterministic. All cheap.

Why a checklist beats a single "good" predicate

Diagnostic — when something fails, you know what failed
Composable — add a new criterion when you discover a new failure mode
Reportable — a 7/10 pass rate is more informative than a binary "failed"
Stable — small changes to the model only break some criteria, not all

Why this isn't an LLM judge

We explicitly do not use an LLM as the evaluator. (See feedback_no_llm_judge.md.) Reasons:

Cost: every eval call burns quota
Drift: the judge model changes; your scores drift even when your model didn't
Confidence: a deterministic predicate is reproducible; an LLM judge is sampled

If a criterion seems to need an LLM judge ("is this response polite?"), narrow the lesson — write a deterministic check that approximates politeness (no all-caps, no swearwords from a list, length not hostile-short).

Today

One output. Three deterministic checks. All three must pass. Foundation for next week's eval suite (input/expected pairs + the same kind of checking).

Day 14 · ~11 min●

"Did the LLM do well?" is a vague question. Did the output meet these specific criteria? is a question your code can answer. Today: a tiny eval checklist for one output.

python

from pydantic_ai import Agent

prompt = "Write a one-sentence elevator pitch for a library book club. Mention the word 'book'."
output = Agent(model).run_sync(prompt).output.strip()

# Three criteria
criteria = {
    "length_ok": 5 <= len(output.split()) <= 30,           # 5-30 words
    "ends_with_period": output.endswith("."),               # format check
    "contains_book": "book" in output.lower(),              # keyword check
}

for name, passed in criteria.items():
    print(f"  {'PASS' if passed else 'FAIL'} — {name}")

assert all(criteria.values()), f"failed: {criteria}"

Each criterion is a Python check. Pass/fail is bool. The eval is a dict.

Right. The structure scales. Three criteria today; thirty if you're shipping serious software. Each one is a small predicate — easy to write, easy to debug, free to run.

What if a criterion fails — do I retry?

Eval criteria — checklist for one output

python

output = call_llm(prompt)

criteria = {
    "check_1": predicate_1(output),
    "check_2": predicate_2(output),
    "check_3": predicate_3(output),
}

score = sum(criteria.values()) / len(criteria)   # pass rate

A dict of name → bool. Pass rate is sum / len. Trivial to compute, trivial to log.

Useful criterion families

Family	Examples
Length	`5 <= words <= 30`, `chars < 500`
Format	ends with period, JSON-parseable, regex match
Content	contains required keyword, mentions specific entity
Negative	does NOT contain blocked terms
Structure	required keys present, nested types correct
Numeric	sum of fields ≈ expected, count of items in range

All deterministic. All cheap.

Why a checklist beats a single "good" predicate

Diagnostic — when something fails, you know what failed
Composable — add a new criterion when you discover a new failure mode
Reportable — a 7/10 pass rate is more informative than a binary "failed"
Stable — small changes to the model only break some criteria, not all

Why this isn't an LLM judge

We explicitly do not use an LLM as the evaluator. (See feedback_no_llm_judge.md.) Reasons:

Cost: every eval call burns quota
Drift: the judge model changes; your scores drift even when your model didn't
Confidence: a deterministic predicate is reproducible; an LLM judge is sampled

Today

One output. Three deterministic checks. All three must pass. Foundation for next week's eval suite (input/expected pairs + the same kind of checking).

Eval criteria — checklist for one output

Useful criterion families

Why a checklist beats a single "good" predicate

Why this isn't an LLM judge

Today

Eval criteria — checklist for one output

Useful criterion families

Why a checklist beats a single "good" predicate

Why this isn't an LLM judge

Today

Eval criteria — checklist for one output

Useful criterion families

Why a checklist beats a single "good" predicate

Why this isn't an LLM judge

Today

Sign up to practice

Eval criteria — checklist for one output

Useful criterion families

Why a checklist beats a single "good" predicate

Why this isn't an LLM judge

Today

Sign up to practice