Embedded chunks need to live somewhere. The simplest store is a Python dict — {chunk_id: vector} — backed by a JSON file for persistence. Production switches to a vector database (pgvector, Pinecone, Weaviate); the interface is the same.
Round-trip pattern:
import json
store = {}
for i, chunk in enumerate(chunks):
store[f"chunk-{i}"] = embed(chunk)
with open("vectors.json", "w") as f:
json.dump(store, f)
# Later, in another process
with open("vectors.json") as f:
reloaded = json.load(f)
assert reloaded == store # exact equality round-trips through JSONThat's it? No special database?
For a 5-chunk corpus, a dict is faster than a database call. The threshold to switch is roughly 10,000 vectors — past that, linear scan over the dict gets slow. Below that, JSON-on-disk is fine.
And the chunk text itself?
Store it alongside. Common pattern: {chunk_id: {"text": ..., "vector": [...]}}. When you retrieve top-k by cosine, you also need the text to feed back into the LLM.
import json
def build_store(chunks, embed_fn):
return {f"chunk-{i}": {"text": c, "vector": embed_fn(c)} for i, c in enumerate(chunks)}
def save_store(store, path):
with open(path, "w") as f:
json.dump(store, f)
def load_store(path):
with open(path) as f:
return json.load(f)Three functions, three responsibilities. Build, persist, restore.
For querying-at-scale you'd graduate to SQLite-with-vectors or a vector DB. We stay with JSON because the concept is what matters — store the (id, text, vector) trio so semantic search has something to search over.
store = build_store(chunks, embed)
save_store(store, "vectors.json")
reloaded = load_store("vectors.json")
assert reloaded == storeIf this assertion fails, your serialisation is silently corrupting data. Catch it once and you'll trust the rest of the pipeline. (Common cause: storing numpy arrays — json.dump chokes. Convert to plain lists first.)
Reminder from earlier weeks: the sandbox filesystem is wiped between runs. JSON-to-disk works within a single run; cross-run persistence needs a real database (or sheets, or notion). For lessons we keep persistence in-process.
Embedded chunks need to live somewhere. The simplest store is a Python dict — {chunk_id: vector} — backed by a JSON file for persistence. Production switches to a vector database (pgvector, Pinecone, Weaviate); the interface is the same.
Round-trip pattern:
import json
store = {}
for i, chunk in enumerate(chunks):
store[f"chunk-{i}"] = embed(chunk)
with open("vectors.json", "w") as f:
json.dump(store, f)
# Later, in another process
with open("vectors.json") as f:
reloaded = json.load(f)
assert reloaded == store # exact equality round-trips through JSONThat's it? No special database?
For a 5-chunk corpus, a dict is faster than a database call. The threshold to switch is roughly 10,000 vectors — past that, linear scan over the dict gets slow. Below that, JSON-on-disk is fine.
And the chunk text itself?
Store it alongside. Common pattern: {chunk_id: {"text": ..., "vector": [...]}}. When you retrieve top-k by cosine, you also need the text to feed back into the LLM.
import json
def build_store(chunks, embed_fn):
return {f"chunk-{i}": {"text": c, "vector": embed_fn(c)} for i, c in enumerate(chunks)}
def save_store(store, path):
with open(path, "w") as f:
json.dump(store, f)
def load_store(path):
with open(path) as f:
return json.load(f)Three functions, three responsibilities. Build, persist, restore.
For querying-at-scale you'd graduate to SQLite-with-vectors or a vector DB. We stay with JSON because the concept is what matters — store the (id, text, vector) trio so semantic search has something to search over.
store = build_store(chunks, embed)
save_store(store, "vectors.json")
reloaded = load_store("vectors.json")
assert reloaded == storeIf this assertion fails, your serialisation is silently corrupting data. Catch it once and you'll trust the rest of the pipeline. (Common cause: storing numpy arrays — json.dump chokes. Convert to plain lists first.)
Reminder from earlier weeks: the sandbox filesystem is wiped between runs. JSON-to-disk works within a single run; cross-run persistence needs a real database (or sheets, or notion). For lessons we keep persistence in-process.
Create a free account to get started. Paid plans unlock all tracks.