Build AI Apps with Python: Text Splitting — Break Documents into Chunks | Episode 13
Video: Build AI Apps with Python: Text Splitting — Break Documents into Chunks | Episode 13 by Taught by Celeste AI - AI Coding Coach
Student code: github.com/GoCelesteAI/build-ai-apps-python/tree/main/episode13 Twenty lines of Python, two parameters —
chunk_sizeandoverlap— and you have a working text splitter.
In Episode 12 we crammed an entire fictional handbook into the prompt and let Claude answer questions about it. That worked because the handbook was tiny. Real documents — a 200-page policy manual, a year of meeting notes, a wiki — don't fit. Claude has a large context window, but using it inefficiently is slow and expensive, and most of the document is irrelevant to any single question anyway.
The first preparatory step in any real RAG pipeline is chunking: breaking long documents into smaller pieces. Each piece becomes a candidate for retrieval. When a question comes in, the system finds the few chunks most likely to contain the answer and sends only those to Claude.
This episode is about how to split text. The technique is shockingly simple — a few lines of Python — but the parameters you choose materially affect the quality of the whole RAG system downstream.
What we're building
A split_text(text, chunk_size, overlap) function that:
- Walks through a long string in fixed-size windows.
- Captures each window as a chunk.
- Advances by
chunk_size - overlapso consecutive chunks share some bytes. - Returns a list of non-empty trimmed chunks.
We'll feed it a longer Acme Corp handbook (five chapters this time) and inspect the chunks at three different sizes.
Why fixed-size windows are good enough
There are dozens of ways to split text. By sentences. By paragraphs. By markdown headings. By structural parsing. Some are better than fixed-size for specific document types, but fixed-size is the right place to start because:
- It always works. Any string of bytes can be cut at fixed offsets. No parser to fail on weird input.
- Chunks are predictable in size. Important when each chunk has to fit in an embedding model's input limit.
- The implementation is trivial. No dependencies, no edge cases that take an afternoon to debug.
You can layer fancier strategies on top later — split on paragraphs first, then size-cap each piece. But for any new RAG project, start with fixed-size and only add complexity when it's clearly needed.
The splitter
def split_text(text, chunk_size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk.strip())
start = end - overlap
return [c for c in chunks if c]
Eight lines. Walk the string with a sliding window. Each iteration grabs a window of chunk_size characters and appends it to the result list. The start pointer advances by chunk_size - overlap, so each new chunk overlaps with the previous one by overlap characters.
The two strip() and filter-out-empties calls handle the edge case where whitespace at the end produces empty chunks. Defensive but cheap.
That's the splitter. Everything else in this episode is what numbers to pick and why.
What chunk_size controls
chunk_size is the maximum bytes per chunk. Pick it based on three constraints:
The embedding model's input limit. Sentence-Transformers models (the kind we'll use in Episode 14) typically take 256 to 512 tokens — roughly 1,000 to 2,000 characters. Chunks larger than that get truncated. So chunk_size should fit comfortably under the limit.
Retrieval precision. Smaller chunks mean each chunk is about one specific thing. A 100-character chunk might cover one sentence; a 2,000-character chunk might cover a whole policy section. Smaller chunks retrieve more precisely (you find the sentence about leave policy) but may miss context (you don't see the surrounding paragraph).
Generator context. When you retrieve N chunks and concatenate them, the total has to fit in Claude's prompt budget. If chunks are small, you can send more of them. If chunks are large, you can send fewer.
For most use cases, 300 to 800 characters is a good range. Today's example uses 500.
What overlap does
start = end - overlap
overlap is the number of characters each chunk shares with its predecessor. This matters because important sentences sometimes straddle chunk boundaries. Without overlap, a sentence beginning in chunk 3 and ending in chunk 4 is split — and either chunk alone is incoherent.
With overlap, the boundary content appears in both chunks. Whichever chunk gets retrieved, the full sentence is intact.
How much overlap? Usually 5–20% of chunk_size. Today's example uses 50 characters out of 500 — a 10% overlap. Enough to catch most sentence boundaries; small enough that we're not duplicating most of our document.
Running it
:!python %. The script splits the handbook and prints each chunk:
Document length: 2031 characters
Chunk size: 500, Overlap: 50
Number of chunks: 5
==================================================
Chunk 1 (497 chars):
----------------------------------------
Acme Corp Employee Handbook - 2026 Edition
Chapter 1: Remote Work Policy
All employees may work remotely up to 3 days per week. Remote work requires manager approval via the WorkFlex portal...
Chunk 2 (498 chars):
----------------------------------------
e in office on Tuesday and Thursday. International remote work requires HR approval...
Chunk 3 (497 chars):
...
Chunk 4 (498 chars):
...
Chunk 5 (291 chars):
...
The first chunk starts at byte 0. The second starts at byte 450 (500 - 50). The last chunk is shorter because we ran out of document.
Notice chunk 2 starts mid-word — "e in office on Tuesday...". That's the cost of fixed-size splitting: you cut wherever the boundary lands, including inside words. The overlap rescues the situation: the original "must be in office" phrase appears at the end of chunk 1 and near the start of chunk 2, so retrieval finds it either way.
Sweeping the parameter
for size in [200, 500, 1000]:
chunks = split_text(handbook, chunk_size=size, overlap=50)
print(f"Size {size}: {len(chunks)} chunks")
Output:
Size 200: 13 chunks
Size 500: 5 chunks
Size 1000: 3 chunks
This is the core trade-off in chunking. Smaller chunk_size → more chunks → more precise retrieval but more index entries to manage and search through. Larger chunk_size → fewer chunks → coarser retrieval but more context per chunk.
There's no universally right answer. For policy documents you're answering specific questions about — "what's the leave allowance?" — smaller chunks (200–400) often work better. For documents where you need broader context — "summarise the company's approach to remote work" — larger chunks (800–1,500) work better.
Build, evaluate, adjust. We'll come back to evaluation in Episode 22.
Smarter splitters (for later)
Once you have the basics, a few enhancements that real RAG systems use:
Split on paragraph or sentence boundaries. Walk the document, accumulate paragraphs into a buffer, flush the buffer when it exceeds chunk_size. Avoids cutting mid-word.
Recursive splitting. Try big delimiters first (\n\n for paragraphs), fall back to smaller ones (\n, then ., then space). Keeps semantic boundaries intact.
Document-structure-aware splitting. For markdown, split on headings. For code, split on functions. For PDFs, split on pages or sections. Specialised, but worth it for important document types.
Token-based instead of character-based. Instead of text[start:end], count tokens (words or subwords). More accurate for embedding-model limits, but requires a tokenizer.
Libraries like LangChain and LlamaIndex have all of these built in. For learning, our 8-line version is fine.
Common mistakes
Zero overlap. Sentences split at boundaries become incoherent in both chunks. Always keep some overlap.
Too large. Chunks bigger than the embedding model can encode get truncated silently. Stay under the model's limit.
Too small. A 50-character chunk barely contains a sentence. Retrieval becomes noisy and the model gets fragments rather than context.
Splitting binary or pre-formatted content. A code block or a table chopped at byte 487 is unparseable. For documents with structure, use a structure-aware splitter.
Ignoring metadata. Each chunk should know which document it came from, what page, what section. We'll add that in Episode 17 — but plan for it now.
What's next
Next episode: embeddings. Splitting gave us chunks. Now we need a way to compare them — to ask "which chunk is most relevant to this question?". Embeddings turn each chunk (and each question) into a list of numbers, and similarity becomes math: vectors close together are related; vectors far apart are not.
Recap
What we did today. Wrote an 8-line text splitter that walks a string in fixed-size windows with overlap. Applied it to a 5-chapter Acme Corp handbook and inspected the resulting chunks. Studied the trade-off between chunk count and chunk size. Acknowledged that fancier splitters exist but that the fixed-size version handles 80% of cases.
You haven't built a search system. You've prepared the input that the next two episodes will turn into a search system.
Next episode: embeddings. See you in the next one.