Embedding a 10,000-word document into a single vector loses everything. You squash all the meaning into one direction. Chunking splits the text first, embeds each chunk separately, and lets you retrieve the part that's relevant.
Two common strategies:
# 1. Sentence-based — preserves natural boundaries
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
# 2. Fixed-window — predictable size, may split mid-sentence
def window_chunk(text, size=200, overlap=50):
chunks = []
i = 0
while i < len(text):
chunks.append(text[i:i+size])
i += size - overlap
return chunksWhen do I want which?
Sentence-based for prose where sentence boundaries are meaningful. Fixed-window with overlap for unstructured logs / transcripts where you can't trust the punctuation. Most production RAG pipelines use fixed-window with ~10–20% overlap so a sentence cut in half still appears whole in one chunk.
And the chunk size?
Tradeoff. Smaller chunks = more precise retrieval, but might lack surrounding context. Larger chunks = more context, but more tokens shipped to the model and noisier matches. 200–500 chars is a common starting point. Tune against your eval suite (week 2).
Production RAG ingests source text by chunking it into 200–800 char pieces, embedding each, and storing the (chunk, vector) pair.
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
sentences = [s.strip() for s in sentences if s.strip()]Boundaries align with meaning. Each chunk = one sentence. Predictable for prose, fragile for messy text.
def window_chunk(text, size=200, overlap=50):
chunks = []
i = 0
while i < len(text):
chunks.append(text[i:i+size])
i += size - overlap
return chunksEvery chunk is exactly size chars (last one may be shorter). Overlap means a sentence cut at position 200 reappears whole in the next chunk starting at position 150. Default in production pipelines.
| Strategy | Precision | Context | Best for |
|---|---|---|---|
| Sentence | High | Low | Prose with clean punctuation |
| Fixed-window | Medium | Medium | Logs, transcripts, mixed text |
| Recursive (paragraph → sentence → token) | High | High | Structured docs, advanced |
We stay with sentence + fixed-window in this lesson. Recursive chunkers (LangChain's RecursiveCharacterTextSplitter) are the production-grade extension — same idea, more configurable.
Embedding a 10,000-word document into a single vector loses everything. You squash all the meaning into one direction. Chunking splits the text first, embeds each chunk separately, and lets you retrieve the part that's relevant.
Two common strategies:
# 1. Sentence-based — preserves natural boundaries
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
# 2. Fixed-window — predictable size, may split mid-sentence
def window_chunk(text, size=200, overlap=50):
chunks = []
i = 0
while i < len(text):
chunks.append(text[i:i+size])
i += size - overlap
return chunksWhen do I want which?
Sentence-based for prose where sentence boundaries are meaningful. Fixed-window with overlap for unstructured logs / transcripts where you can't trust the punctuation. Most production RAG pipelines use fixed-window with ~10–20% overlap so a sentence cut in half still appears whole in one chunk.
And the chunk size?
Tradeoff. Smaller chunks = more precise retrieval, but might lack surrounding context. Larger chunks = more context, but more tokens shipped to the model and noisier matches. 200–500 chars is a common starting point. Tune against your eval suite (week 2).
Production RAG ingests source text by chunking it into 200–800 char pieces, embedding each, and storing the (chunk, vector) pair.
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
sentences = [s.strip() for s in sentences if s.strip()]Boundaries align with meaning. Each chunk = one sentence. Predictable for prose, fragile for messy text.
def window_chunk(text, size=200, overlap=50):
chunks = []
i = 0
while i < len(text):
chunks.append(text[i:i+size])
i += size - overlap
return chunksEvery chunk is exactly size chars (last one may be shorter). Overlap means a sentence cut at position 200 reappears whole in the next chunk starting at position 150. Default in production pipelines.
| Strategy | Precision | Context | Best for |
|---|---|---|---|
| Sentence | High | Low | Prose with clean punctuation |
| Fixed-window | Medium | Medium | Logs, transcripts, mixed text |
| Recursive (paragraph → sentence → token) | High | High | Structured docs, advanced |
We stay with sentence + fixed-window in this lesson. Recursive chunkers (LangChain's RecursiveCharacterTextSplitter) are the production-grade extension — same idea, more configurable.
Create a free account to get started. Paid plans unlock all tracks.