Build AI Apps with Python: Text Splitting — Break Documents into Chunks | Episode 13
Video: Build AI Apps with Python: Text Splitting — Break Documents into Chunks | Episode 13 by Taught by Celeste AI - AI Coding Coach
Watch full page →Build AI Apps with Python: Text Splitting — Break Documents into Chunks
When working with large documents in AI applications, it's essential to split the text into manageable chunks to fit within prompt size limits and improve retrieval accuracy. This example demonstrates how to build a text splitter in Python that divides a company handbook into overlapping chunks of configurable size, preserving sentence continuity across boundaries.
Code
def split_text(text, chunk_size=500, overlap=50):
"""
Splits text into chunks with specified size and overlap.
Args:
text (str): The input document string.
chunk_size (int): Number of characters per chunk.
overlap (int): Number of characters to overlap between chunks.
Returns:
List[str]: List of text chunks.
"""
chunks = []
start = 0
text_length = len(text)
while start < text_length:
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
# Move start forward by chunk_size minus overlap to create sliding window effect
start += chunk_size - overlap
return chunks
# Example usage:
if __name__ == "__main__":
with open("company_handbook.txt", "r", encoding="utf-8") as f:
handbook = f.read()
# Split into chunks of 500 characters with 50 characters overlap
chunks = split_text(handbook, chunk_size=500, overlap=50)
# Display chunks with their number and length
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i} (length {len(chunk)}):")
print(chunk)
print("-" * 40)
Key Points
- Large documents must be split into smaller chunks to fit within AI model input limits.
- Overlapping chunks prevent cutting sentences abruptly and preserve context across boundaries.
- The sliding window pattern uses a while loop advancing by chunk_size minus overlap each iteration.
- Configurable chunk sizes allow balancing between precision (smaller chunks) and context (larger chunks).
- Displaying chunk numbers and lengths helps verify the splitting behavior and adjust parameters.