RAG works when the answer is in the corpus. When it isn't, two things go wrong:
results = top_k(query, store, k=3)
top_score = results[0][1]
second_score = results[1][1]
if top_score < 0.3:
return "out-of-corpus" # nothing relevant
if top_score - second_score < 0.05:
return "ambiguous" # multiple chunks tied
return "answer" # confident retrievalThreshold values — 0.3, 0.05 — those are tunable?
Yes. You set them once against your eval suite. Too strict → many "I don't know" responses. Too loose → hallucinated answers slip through. Tune in week 2 (eval lesson, tomorrow).
And the model still has to refuse?
No — your code refuses. Skip the LLM call entirely if retrieval failed. Saves quota and removes the chance the model invents an answer.
The gating pattern:
def rag_with_gating(query, store, k=3, abs_threshold=0.3, gap_threshold=0.05):
top = top_k(query, store, k)
top_score = top[0][1]
second_score = top[1][1] if len(top) > 1 else 0.0
if top_score < abs_threshold:
return {"status": "out-of-corpus", "answer": "I don't have a relevant answer."}
if top_score - second_score < gap_threshold:
return {"status": "ambiguous", "answer": "Could you clarify what you mean?"}
# confident retrieval — generate
contexts = [store[cid]["text"] for cid, _ in top]
answer = generate(query, contexts)
return {"status": "answered", "answer": answer}status tells your downstream code why no answer was produced; you can log and aggregate failures| Threshold | Catches | Tuning |
|---|---|---|
Absolute (top_score < 0.3) | Out-of-corpus — nothing in store is relevant | Tighten if hallucination rate is high |
Gap (top - second < 0.05) | Ambiguous — multiple chunks tied; user query is unclear | Tighten if model picks the wrong of two close chunks |
For a 5-chunk corpus, both thresholds are sensitive. Production tunes these against thousands of real queries with labelled "this should refuse" / "this should answer" cases.
Stale corpus — your data is 6 months old, the world changed, the model confidently echoes outdated facts. The fix is operational (refresh embeddings on a schedule), not algorithmic.
RAG works when the answer is in the corpus. When it isn't, two things go wrong:
results = top_k(query, store, k=3)
top_score = results[0][1]
second_score = results[1][1]
if top_score < 0.3:
return "out-of-corpus" # nothing relevant
if top_score - second_score < 0.05:
return "ambiguous" # multiple chunks tied
return "answer" # confident retrievalThreshold values — 0.3, 0.05 — those are tunable?
Yes. You set them once against your eval suite. Too strict → many "I don't know" responses. Too loose → hallucinated answers slip through. Tune in week 2 (eval lesson, tomorrow).
And the model still has to refuse?
No — your code refuses. Skip the LLM call entirely if retrieval failed. Saves quota and removes the chance the model invents an answer.
The gating pattern:
def rag_with_gating(query, store, k=3, abs_threshold=0.3, gap_threshold=0.05):
top = top_k(query, store, k)
top_score = top[0][1]
second_score = top[1][1] if len(top) > 1 else 0.0
if top_score < abs_threshold:
return {"status": "out-of-corpus", "answer": "I don't have a relevant answer."}
if top_score - second_score < gap_threshold:
return {"status": "ambiguous", "answer": "Could you clarify what you mean?"}
# confident retrieval — generate
contexts = [store[cid]["text"] for cid, _ in top]
answer = generate(query, contexts)
return {"status": "answered", "answer": answer}status tells your downstream code why no answer was produced; you can log and aggregate failures| Threshold | Catches | Tuning |
|---|---|---|
Absolute (top_score < 0.3) | Out-of-corpus — nothing in store is relevant | Tighten if hallucination rate is high |
Gap (top - second < 0.05) | Ambiguous — multiple chunks tied; user query is unclear | Tighten if model picks the wrong of two close chunks |
For a 5-chunk corpus, both thresholds are sensitive. Production tunes these against thousands of real queries with labelled "this should refuse" / "this should answer" cases.
Stale corpus — your data is 6 months old, the world changed, the model confidently echoes outdated facts. The fix is operational (refresh embeddings on a schedule), not algorithmic.
Create a free account to get started. Paid plans unlock all tracks.