You have stored vectors. You have a query. Semantic search is one loop: embed the query, compute cosine against every stored vector, pick the top-k.
def top_k(query, store, embed_fn, k=2):
qv = embed_fn(query)
scored = []
for cid, entry in store.items():
sim = cosine(qv, entry["vector"])
scored.append((cid, sim))
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:k]That's a linear scan over the whole store?
Yes. For 5 chunks, fast. For 5000 chunks, still milliseconds. For 5 million, you graduate to a vector database with an ANN index. The retrieval interface is identical — give a query, get top-k. Only the implementation changes.
And k? Why 2?
Tradeoff. Higher k = more context for the LLM, more tokens, more chance the real answer is in there. Lower k = tighter, cheaper, may miss the relevant chunk. Default 3–5 in production; tune against your eval suite.
def top_k(query, store, embed_fn, k=3):
qv = embed_fn(query)
scored = [
(cid, cosine(qv, entry["vector"]))
for cid, entry in store.items()
]
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:k]Line by line:
Most of the time the caller wants the chunk text, not just the id, to feed into a prompt:
top_chunks = top_k(query, store, embed, k=3)
contexts = [store[cid]["text"] for cid, _ in top_chunks]No match should rarely return the least bad — better to refuse:
top = top_k(query, store, embed, k=3)
if top[0][1] < 0.3: # tune this against your data
return "I don't have a relevant answer"This is the foundation of RAG failure-mode handling (lesson L9 this week — RAG failure modes).
Before semantic search, you matched strings (substring, keyword). After: you match meaning. "How do I reset my password?" finds a chunk about "forgotten credentials" even though no word overlaps.
This is the unlock that makes RAG work.
You have stored vectors. You have a query. Semantic search is one loop: embed the query, compute cosine against every stored vector, pick the top-k.
def top_k(query, store, embed_fn, k=2):
qv = embed_fn(query)
scored = []
for cid, entry in store.items():
sim = cosine(qv, entry["vector"])
scored.append((cid, sim))
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:k]That's a linear scan over the whole store?
Yes. For 5 chunks, fast. For 5000 chunks, still milliseconds. For 5 million, you graduate to a vector database with an ANN index. The retrieval interface is identical — give a query, get top-k. Only the implementation changes.
And k? Why 2?
Tradeoff. Higher k = more context for the LLM, more tokens, more chance the real answer is in there. Lower k = tighter, cheaper, may miss the relevant chunk. Default 3–5 in production; tune against your eval suite.
def top_k(query, store, embed_fn, k=3):
qv = embed_fn(query)
scored = [
(cid, cosine(qv, entry["vector"]))
for cid, entry in store.items()
]
scored.sort(key=lambda x: x[1], reverse=True)
return scored[:k]Line by line:
Most of the time the caller wants the chunk text, not just the id, to feed into a prompt:
top_chunks = top_k(query, store, embed, k=3)
contexts = [store[cid]["text"] for cid, _ in top_chunks]No match should rarely return the least bad — better to refuse:
top = top_k(query, store, embed, k=3)
if top[0][1] < 0.3: # tune this against your data
return "I don't have a relevant answer"This is the foundation of RAG failure-mode handling (lesson L9 this week — RAG failure modes).
Before semantic search, you matched strings (substring, keyword). After: you match meaning. "How do I reset my password?" finds a chunk about "forgotten credentials" even though no word overlaps.
This is the unlock that makes RAG work.
Create a free account to get started. Paid plans unlock all tracks.