How much does zuzu.codes cost?

The starter track is free — read all lessons and practice for free. Full access to every track (current and future) is $14.99/month. Cancel anytime.

How long does each track take?

Each track is designed as a 30-day challenge — one lesson per day, about 15 minutes each. Go at your own pace, but the structure is built around daily consistency.

What's the lesson format?

Each lesson is a student-teacher dialogue with code examples, followed by a hands-on code challenge in an in-browser editor. You read, you understand, then you write real code.

Do I need prior coding experience?

Our beginner track starts from absolute zero — no prior experience needed. Advanced tracks build on earlier ones, and the platform tells you exactly where to start.

How is zuzu.codes different from freeCodeCamp or Codecademy?

zuzu.codes uses a structured 30-day track format with dialogue-based teaching, an in-browser code editor, and gamification (XP, streaks, progress tracking). The format builds genuine understanding through daily practice.

RAG failure modes — Ai Mastery

Day 10 · ~11 min●

RAG works when the answer is in the corpus. When it isn't, two things go wrong:

Out-of-corpus — the user asks something none of your chunks address. Naive RAG hallucinates. The fix: detect low retrieval similarity and refuse.
Ambiguous — the query could mean two different things; retrieval pulls a confident-but-wrong chunk. The fix: detect when top-2 scores are both high — multiple chunks claim relevance — and ask for clarification.

python

results = top_k(query, store, k=3)
top_score = results[0][1]
second_score = results[1][1]

if top_score < 0.3:
    return "out-of-corpus"        # nothing relevant
if top_score - second_score < 0.05:
    return "ambiguous"             # multiple chunks tied
return "answer"                    # confident retrieval

Threshold values — 0.3, 0.05 — those are tunable?

Yes. You set them once against your eval suite. Too strict → many "I don't know" responses. Too loose → hallucinated answers slip through. Tune in week 2 (eval lesson, tomorrow).

And the model still has to refuse?

No — your code refuses. Skip the LLM call entirely if retrieval failed. Saves quota and removes the chance the model invents an answer.

RAG failure modes — detect before generate

The gating pattern:

python

def rag_with_gating(query, store, k=3, abs_threshold=0.3, gap_threshold=0.05):
    top = top_k(query, store, k)
    top_score = top[0][1]
    second_score = top[1][1] if len(top) > 1 else 0.0

    if top_score < abs_threshold:
        return {"status": "out-of-corpus", "answer": "I don't have a relevant answer."}
    if top_score - second_score < gap_threshold:
        return {"status": "ambiguous", "answer": "Could you clarify what you mean?"}

    # confident retrieval — generate
    contexts = [store[cid]["text"] for cid, _ in top]
    answer = generate(query, contexts)
    return {"status": "answered", "answer": answer}

Why detect before the LLM call

Cost — refusal is free, generation costs a quota slot
Quality — a clean refusal beats a confident hallucination
Auditability — the status tells your downstream code why no answer was produced; you can log and aggregate failures

The two thresholds

Threshold	Catches	Tuning
Absolute (`top_score < 0.3`)	Out-of-corpus — nothing in store is relevant	Tighten if hallucination rate is high
Gap (`top - second < 0.05`)	Ambiguous — multiple chunks tied; user query is unclear	Tighten if model picks the wrong of two close chunks

For a 5-chunk corpus, both thresholds are sensitive. Production tunes these against thousands of real queries with labelled "this should refuse" / "this should answer" cases.

A third failure mode (out of scope today)

Stale corpus — your data is 6 months old, the world changed, the model confidently echoes outdated facts. The fix is operational (refresh embeddings on a schedule), not algorithmic.

Day 10 · ~11 min●

RAG works when the answer is in the corpus. When it isn't, two things go wrong:

Out-of-corpus — the user asks something none of your chunks address. Naive RAG hallucinates. The fix: detect low retrieval similarity and refuse.
Ambiguous — the query could mean two different things; retrieval pulls a confident-but-wrong chunk. The fix: detect when top-2 scores are both high — multiple chunks claim relevance — and ask for clarification.

python

results = top_k(query, store, k=3)
top_score = results[0][1]
second_score = results[1][1]

if top_score < 0.3:
    return "out-of-corpus"        # nothing relevant
if top_score - second_score < 0.05:
    return "ambiguous"             # multiple chunks tied
return "answer"                    # confident retrieval

Threshold values — 0.3, 0.05 — those are tunable?

Yes. You set them once against your eval suite. Too strict → many "I don't know" responses. Too loose → hallucinated answers slip through. Tune in week 2 (eval lesson, tomorrow).

And the model still has to refuse?

No — your code refuses. Skip the LLM call entirely if retrieval failed. Saves quota and removes the chance the model invents an answer.

RAG failure modes — detect before generate

The gating pattern:

python

def rag_with_gating(query, store, k=3, abs_threshold=0.3, gap_threshold=0.05):
    top = top_k(query, store, k)
    top_score = top[0][1]
    second_score = top[1][1] if len(top) > 1 else 0.0

    if top_score < abs_threshold:
        return {"status": "out-of-corpus", "answer": "I don't have a relevant answer."}
    if top_score - second_score < gap_threshold:
        return {"status": "ambiguous", "answer": "Could you clarify what you mean?"}

    # confident retrieval — generate
    contexts = [store[cid]["text"] for cid, _ in top]
    answer = generate(query, contexts)
    return {"status": "answered", "answer": answer}

Why detect before the LLM call

Cost — refusal is free, generation costs a quota slot
Quality — a clean refusal beats a confident hallucination
Auditability — the status tells your downstream code why no answer was produced; you can log and aggregate failures

The two thresholds

Threshold	Catches	Tuning
Absolute (`top_score < 0.3`)	Out-of-corpus — nothing in store is relevant	Tighten if hallucination rate is high
Gap (`top - second < 0.05`)	Ambiguous — multiple chunks tied; user query is unclear	Tighten if model picks the wrong of two close chunks

For a 5-chunk corpus, both thresholds are sensitive. Production tunes these against thousands of real queries with labelled "this should refuse" / "this should answer" cases.

A third failure mode (out of scope today)

Stale corpus — your data is 6 months old, the world changed, the model confidently echoes outdated facts. The fix is operational (refresh embeddings on a schedule), not algorithmic.

RAG failure modes — detect before generate

Why detect before the LLM call

The two thresholds

A third failure mode (out of scope today)

RAG failure modes — detect before generate

Why detect before the LLM call

The two thresholds

A third failure mode (out of scope today)

RAG failure modes — detect before generate

Why detect before the LLM call

The two thresholds

A third failure mode (out of scope today)

Sign up to practice

RAG failure modes — detect before generate

Why detect before the LLM call

The two thresholds

A third failure mode (out of scope today)

Sign up to practice