Build AI Apps with Python: Multi-Document RAG — Metadata Filtering & Citations | Episode 17
Video: Build AI Apps with Python: Multi-Document RAG — Metadata Filtering & Citations | Episode 17 by Taught by Celeste AI - AI Coding Coach
Watch full page →Build AI Apps with Python: Multi-Document RAG — Metadata Filtering & Citations
This tutorial demonstrates how to build a Retrieval-Augmented Generation (RAG) system in Python that handles multiple documents by tagging each text chunk with metadata indicating its source. Using ChromaDB for vector storage and Anthropic's Claude for generation, you can filter queries by document source and produce answers that cite exactly where each fact originated.
Code
from chromadb import Client
from chromadb.config import Settings
from anthropic import Anthropic, HUMAN_PROMPT, AI_PROMPT
# Initialize ChromaDB client and collection
client = Client(Settings(chroma_db_impl="duckdb+parquet", persist_directory="./chromadb"))
collection = client.get_or_create_collection(name="company_knowledge_base")
# Example documents split into chunks with metadata tags
documents = {
"employee_handbook": [
"Our company values integrity and teamwork.",
"Employees must adhere to the code of conduct.",
"Work hours are from 9am to 5pm, Monday to Friday."
],
"faq": [
"How to reset your password? Use the self-service portal.",
"Who to contact for IT support? Email it-support@company.com.",
"What is the vacation policy? 15 days paid leave per year."
],
"company_policy": [
"All employees must complete annual compliance training.",
"Remote work is allowed with manager approval.",
"Confidential information must not be shared externally."
]
}
# Prepare data for insertion: texts and corresponding metadata
texts = []
metadatas = []
for source, chunks in documents.items():
for chunk in chunks:
texts.append(chunk)
metadatas.append({"source": source})
# Add chunks with metadata to the collection
collection.add(documents=texts, metadatas=metadatas)
# Function to query with optional metadata filtering
def query_knowledge_base(question, source_filter=None):
where_clause = {"source": source_filter} if source_filter else None
results = collection.query(
query_texts=[question],
n_results=3,
where=where_clause
)
# Build context with source citations
context = ""
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
context += f"Source ({meta['source']}): {doc}\n"
# Create prompt for Claude with context and question
prompt = (
HUMAN_PROMPT +
f"Use the following company documents to answer the question. Cite sources for each fact.\n\n{context}\nQuestion: {question}\n" +
AI_PROMPT
)
# Initialize Anthropic client and get completion
client_anthropic = Anthropic(api_key="your-anthropic-api-key")
response = client_anthropic.completions.create(
model="claude-v1",
prompt=prompt,
max_tokens_to_sample=300,
stop_sequences=[HUMAN_PROMPT]
)
return response.completion.strip()
# Example usage:
answer_all = query_knowledge_base("What is the vacation policy?")
answer_faq = query_knowledge_base("How do I reset my password?", source_filter="faq")
print("Answer (all documents):", answer_all)
print("Answer (FAQ only):", answer_faq)
Key Points
- Tag each text chunk with metadata indicating its source document to enable precise filtering.
- Use ChromaDB's `where` clause to filter vector search results by metadata, such as document name.
- Include source-labeled context in prompts to generate answers with explicit citations for auditability.
- Multi-document RAG allows querying across all sources or restricting to a single document for focused answers.
- This pattern scales to production systems by organizing knowledge bases with metadata and combining retrieval with generation.