Build AI Apps with Python: Why RAG? — Give AI Your Own Data | Episode 12
Video: Build AI Apps with Python: Why RAG? — Give AI Your Own Data | Episode 12 by Taught by Celeste AI - AI Coding Coach
Student code: github.com/GoCelesteAI/build-ai-apps-python/tree/main/episode12 A fictional company handbook and a one-line trick that makes Claude an expert on it.
There's a question every developer eventually asks Claude: "What does our company's leave policy say?" Claude doesn't know. Claude has never seen your handbook. The model's training data ended at a known date, and it never included your private documents in the first place.
This is the moment you need RAG — Retrieval-Augmented Generation. The pattern that turns Claude from "knows what was in its training data" into "knows what's in your documents." It's the technique behind every AI feature that answers questions about a company's data, summarises a PDF the user uploaded, or runs Q&A over a knowledge base.
This episode is the why. Episodes 13–17 build the full pipeline. Today we're going to demonstrate the problem and show the simplest possible solution: stuff the document into the prompt and ask the question.
The two failure modes Claude has
Models confidently fail in two ways. Either they say "I don't know" — which is honest but useless — or they make something up. The latter is the worse failure: a confident wrong answer about your company's leave policy is more harmful than no answer.
The model isn't lying on purpose. It's predicting the next token based on patterns it learned. Without specific information about your company, it falls back to general patterns about employee handbooks and produces something plausible-sounding. That's a hallucination.
The fix is structural, not behavioural. We're not going to ask Claude to be more careful. We're going to give it the actual document, in the prompt, every time.
What we're building
We'll define a fake company handbook for "Acme Corp" — the kind of policy text every company has — and ask Claude two questions about it. Once without showing it the handbook, and once with. The before-and-after lets you feel the difference.
The "RAG" piece in this episode is a one-liner: we put the handbook in the user message right before the question. That's the entire technique in its simplest form. Real RAG systems get more sophisticated about which documents to inject, but the core idea — put relevant text in the prompt — is exactly what we'll do today.
The script
handbook = """
Acme Corp Employee Handbook - 2026 Edition
Remote Work Policy:
All employees may work remotely up to 3 days per week.
...
Leave Policy:
Annual leave: 18 days per year for all full-time employees.
...
Tech Allowance:
Every employee receives $2,500 per year for equipment.
...
"""
question = "How many days of annual leave do Acme Corp employees get?"
# Without RAG
print("=== Without RAG ===")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
system="Respond in plain text. No markdown. Be honest if you do not know.",
messages=[{"role": "user", "content": question}],
)
print(response.content[0].text)
# With RAG
print("\n=== With RAG ===")
rag_message = f"""Use the following document to answer the question.
Document:
{handbook}
Question: {question}"""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
system="Respond in plain text. No markdown. Answer based only on the provided document.",
messages=[{"role": "user", "content": rag_message}],
)
print(response.content[0].text)
The two API calls are identical except for two things: the user message (one is just the question, the other is document + question), and the system prompt (the second one tells Claude to use only the document).
What you'll see
The without-RAG run produces something honest:
I don't have specific information about Acme Corp's leave policy. Most companies offer somewhere between 10 and 25 days of annual leave depending on country, role, and tenure, but I can't give you a definitive answer for Acme Corp without seeing your company's actual handbook.
That's a good failure. Claude correctly didn't invent a number. But it also didn't help.
The with-RAG run produces:
Acme Corp employees receive 18 days of annual leave per year for all full-time employees.
Specific. Correct. Grounded in the document. The exact information from the handbook.
Same model, same temperature, same question. The only difference is whether the answer was in the prompt.
What "Retrieval-Augmented Generation" actually means
Three words, three steps:
- Retrieval — find the document(s) relevant to the user's question.
- Augmentation — add the retrieved text to the prompt.
- Generation — let Claude answer using that augmented context.
In today's script we skip the retrieval step. There's only one document; we always send the whole thing. That's fine for a tutorial. It breaks the moment you have more documents than fit in the prompt — a real company handbook might be 200 pages, plus an FAQ, plus a code-of-conduct, plus internal wiki pages.
That's where Episodes 13–17 come in:
- Episode 13: Text splitting. Break documents into chunks that fit in a prompt.
- Episode 14: Embeddings. Turn chunks into vectors so we can compare them mathematically.
- Episode 15: Vector store. Store and search those vectors with ChromaDB.
- Episode 16: The RAG pipeline. Wire retrieval + augmentation + generation into one function.
- Episode 17: Multi-document RAG. Multiple sources, with citations.
After Episode 17, you can ask one question of a knowledge base of arbitrary size. The retrieval step picks the few chunks most likely to contain the answer, the augmentation injects them, and the generation step writes the answer. That's the architecture of every "chat with your docs" product you've ever seen.
The instruction in the system prompt
system="Respond in plain text. No markdown. Answer based only on the provided document."
That last clause is doing real work. "Answer based only on the provided document." Without it, Claude might mix what's in the handbook with general knowledge about typical leave policies. The instruction grounds the response.
In production RAG, you'll often see even stronger constraints:
- "If the answer is not in the provided context, say 'I don't have that information.'"
- "Cite the section of the document that supports your answer."
- "Quote the relevant text verbatim, then explain it."
These constraints reduce hallucination further. They also expose when retrieval failed — if you ask "what's the parking policy?" and the retriever pulled the wrong chunks, you want Claude to say "I don't have that information" rather than guess.
When you don't need RAG
RAG isn't always the right tool. A few cases where you should think before reaching for it:
The document is small enough to always include. If your handbook fits in 2,000 tokens, just put it in the system prompt every time. No retrieval needed. Skip Episodes 13–17 entirely.
The model already knows the answer. Public information that predates the model's cutoff doesn't need RAG. Don't build retrieval for "what's the speed of light" — Claude knows.
The information changes faster than your indexer. RAG works best on documents that are (relatively) stable. If your data changes every minute, you may want a tool that queries the live source instead of a vector database — i.e., function calling from Phase 2, not RAG from Phase 3.
The answer requires reasoning over the whole document. RAG retrieves relevant chunks. If the question is "summarise the entire policy in one paragraph", the relevant chunks are the whole document. Just send it whole.
Common mistakes
Putting the document after the question instead of before. Models attend in order; they're more likely to use context that comes before the question. Document first, question second.
Not telling Claude to ground in the document. Without the explicit instruction, the model may blend the document with general knowledge.
Sending too much context. Long context isn't free. It's slower, more expensive, and degrades the model's attention. Episodes 13–17 build the discipline to send only what's needed.
Confusing RAG with fine-tuning. RAG is runtime: you provide context per query. Fine-tuning is training-time: you adjust the model's weights. RAG is the right answer 95% of the time. It's faster to build, easier to update, and cheaper.
What's next
Next episode: text splitting. Real documents are too large to send whole. We need to break them into chunks of a few hundred words each, with some overlap between chunks so we don't cut a sentence in half. The simplest splitter is twenty lines of Python; we'll write it from scratch and inspect the chunks.
Recap
What we did today. Defined a fictional Acme Corp handbook. Asked Claude two questions about it — first without showing it the handbook, then with. Watched the model fail honestly the first time and answer correctly the second time. Defined the three steps of RAG: retrieve, augment, generate. Acknowledged that today we did the simplest possible "retrieve" (always send the whole document) and that the rest of Phase 3 makes that step real.
You haven't built a knowledge base yet. You've shown yourself the problem RAG solves and the simplest version of the solution. The next five episodes turn this one-document trick into a real retrieval system.
Next episode: text splitting. See you in the next one.