Part of Python AI Tutorial Series

Build AI Apps with Python: Multi-Document RAG — Metadata Filtering & Citations | Episode 17

Celest KimCelest Kim

Video: Build AI Apps with Python: Multi-Document RAG — Metadata Filtering & Citations | Episode 17 by Taught by Celeste AI - AI Coding Coach

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

Student code: github.com/GoCelesteAI/build-ai-apps-python/tree/main/episode17 Three sources. Metadata-tagged chunks. Filtered retrieval. Cited answers.

In Episode 16 our knowledge base was a single flat list of strings. Real knowledge bases never look like that. You have a handbook, an FAQ, a policy doc, an internal wiki, last quarter's all-hands transcript, the engineering style guide. When a user asks "how do I reset my password", you want the answer to come from the FAQ. When they ask "what's the security policy?", you want the policy document. When they ask "how many days off do I get?", you might pull from the handbook and policy and combine.

This episode adds three capabilities to the basic RAG pipeline:

  1. Metadata on every chunk — which document this came from.
  2. Optional source filteringretrieve only from this source.
  3. Citations — Claude tells you where each fact came from.

Once you have these, you have most of what production RAG systems offer.

What we're building

Three small "documents" — handbook.txt, faq.txt, policy.txt — each holding three short chunks. We'll add all of them to one ChromaDB collection with source metadata, then ask three questions:

  • "How many days off do employees get?" — cross-document; should pull from handbook and policy.
  • "How do I reset my password?" — should pull from FAQ; we'll explicitly filter to FAQ to demonstrate.
  • "What are the security requirements?" — should pull from policy.

Claude's answer cites the source for each fact.

Adding metadata

all_docs = []
all_ids = []
all_metadata = []

for i, text in enumerate(handbook):
    all_docs.append(text)
    all_ids.append(f"handbook_{i}")
    all_metadata.append({"source": "handbook.txt"})

for i, text in enumerate(faq):
    all_docs.append(text)
    all_ids.append(f"faq_{i}")
    all_metadata.append({"source": "faq.txt"})

for i, text in enumerate(policy):
    all_docs.append(text)
    all_ids.append(f"policy_{i}")
    all_metadata.append({"source": "policy.txt"})

collection.add(
    documents=all_docs,
    ids=all_ids,
    metadatas=all_metadata,
)

Three parallel arrays. ChromaDB's add() takes metadatas as a list of dicts — one dict per item. The dict can have any keys you want; we keep it simple with just source.

The IDs are prefixed by document name to make them traceable. faq_2 is the third chunk of the FAQ. Useful when debugging — you can look at which exact item was retrieved.

In a larger system, you'd add more metadata: page_number, section, last_updated, chunk_index_in_doc. Each one gives you something to filter on or surface in citations.

Filtered retrieval

def ask(question, source_filter=None):
    query_args = {
        "query_texts": [question],
        "n_results": 3,
    }

    if source_filter:
        query_args["where"] = {"source": source_filter}
        print(f"  [Filter: {source_filter}]")

    results = collection.query(**query_args)

The where parameter restricts retrieval to chunks whose metadata matches. {"source": "faq.txt"} means only return chunks tagged with this source. ChromaDB also supports more complex filters with operators ($eq, $ne, $in, $and, $or), but a single equality filter handles most cases.

Why filter? A few real reasons:

Routing. A question that's clearly an HR question shouldn't pull engineering docs. You can detect intent (rules, classifier, or another LLM call) and filter accordingly.

Permissions. If a user can only see public docs, filter to chunks tagged visibility: public so internal info doesn't leak.

Freshness. Filter to chunks updated in the last six months when stale answers would be misleading.

Topic. A user-selected source ("Search in Policy only") becomes a filter.

Without filtering, the retriever picks among everything. With it, you constrain the search space. Both have their place.

Citation-aware augmentation

context_parts = []
for doc, src in zip(docs, sources):
    context_parts.append(f"[{src}] {doc}")
context = "\n\n".join(context_parts)

prompt = f"""Answer the question using ONLY the provided context.
Cite the source file for each fact in your answer.

Context:
{context}

Question: {question}"""

We label each retrieved chunk with its source: [handbook.txt] All employees get 20 days.... Then we instruct Claude to cite the source file for each fact.

The labelling matters. Without [source] prefixes, Claude has no way to know which chunk came from which document — they're all just text in the prompt. With the prefix, the model can trace each fact back and write "(source: handbook.txt)" after a sentence that drew from it.

This is how production RAG produces citation-rich answers. Good citations — the kind users trust — emerge from a clean attribution pipeline: source lives on the chunk, source travels into the prompt, model is asked to cite, output names the source.

Watching it run

Q: How many days off do employees get?
  Retrieved 3 chunks from: handbook.txt, policy.txt
  A: Employees get 20 days of paid time off per year, with up to 5 days carrying over (source: handbook.txt). Expenses for time-off-related travel follow the standard expense policy (source: policy.txt).

Q: How do I reset my password?
  [Filter: faq.txt]
  Retrieved 3 chunks from: faq.txt
  A: To reset your password, go to the IT portal at helpdesk.acme.com and click Forgot Password. You will receive a reset link by email within 5 minutes (source: faq.txt).

Q: What are the security requirements?
  Retrieved 3 chunks from: policy.txt
  A: Company laptops must use full-disk encryption. Personal devices cannot access production systems. Security training is mandatory every 6 months (source: policy.txt).

Three different patterns of retrieval. Question 1 pulled across two sources because both had relevant content. Question 2 used the explicit filter to retrieve only from the FAQ. Question 3 implicitly pulled from policy because that's where the relevant chunks were.

The citations make the answers trustable. A user who's not sure whether to believe the response can verify by checking the cited document directly.

When to retrieve more vs. fewer chunks

n_results=3 is a defensible default. A few rules of thumb:

  • Single-document, well-scoped questions ("how do I reset my password?"): 1–2 chunks is enough.
  • Cross-document questions ("how do I get help with X?"): 3–5 chunks across sources.
  • Summarise / compare questions: 5–10 chunks, possibly with reranking.
  • Complex multi-step reasoning ("explain our overall remote work policy"): the whole document, retrieved as one block.

The right number is the smallest one that reliably contains the answer. More chunks isn't always better. Past a point, irrelevant chunks dilute the model's focus and make hallucination more likely.

Reranking (a teaser)

In production, you often retrieve more candidates than you intend to send to the LLM, then rerank them with a more powerful (slower) model that scores each candidate against the question. You keep the top 3 by rerank score, even if they weren't the top 3 by vector similarity.

The two-pass design is faster overall than running the rerank model on your whole corpus, and it produces noticeably better retrieval quality. Models like Cohere's Rerank, Voyage's rerank-lite, or open-source cross-encoders like bge-reranker-v2 are common choices.

We don't build it in this series — but if you find your RAG quality plateauing, reranking is the highest-leverage improvement you can make to retrieval.

Common mistakes

Inconsistent metadata across chunks. If half your chunks have source and half have document_name, your filters break. Pick a schema and stick to it.

Filtering away the right answer. Aggressive filters can exclude relevant chunks. Test by running the same question filtered and unfiltered; if the answers differ, your filter is too tight.

Citations without grounding. Asking Claude to cite without giving it source labels in the context just makes it invent file names. The label has to be in the prompt.

Citing only the first chunk. Sometimes Claude cites one source even when synthesising across two. The fix is in the prompt: "Cite every source you draw from." — explicit and repeated.

Treating metadata as a free-form text dump. ChromaDB's filtering works on structured fields. Don't stuff a paragraph into metadata.notes and expect to filter on it.

What's next

Phase 3 closes here. You have everything needed to build a real RAG system: chunking, embeddings, vector storage, retrieval, augmentation, generation, multi-source metadata, filters, citations.

From Episode 18 we move into Phase 4: agents. We'll combine what you've built — tool use from Phase 2 and RAG from Phase 3 — into agents that plan, act, check their own work, and recover from errors. The first one is the ReAct pattern: think, act, observe, repeat.

Recap

What we did today. Stored three small documents in one collection with source metadata on each chunk. Used the where parameter to filter retrieval to a specific source. Labelled each chunk in the augmentation step so Claude could attribute facts. Asked the model to cite, and got back grounded, sourced answers.

You have a multi-document RAG system. The architecture is the same as anything you'll see in a production "chat with your docs" product, just with smaller corpora.

Next episode: the ReAct agent pattern. See you in the next one.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.