Retrieve Augment Generate — Build Your First RAG Pipeline in Python | Episode 16
Video: Retrieve Augment Generate — Build Your First RAG Pipeline in Python | Episode 16 by Taught by Celeste AI - AI Coding Coach
Student code: github.com/GoCelesteAI/build-ai-apps-python/tree/main/episode16 The full pipeline, end to end, in one Python function.
Four episodes' worth of pieces and we finally wire them together. Splitting (Ep13), embeddings (Ep14), and a vector store (Ep15) each solved a sub-problem. Today we put a question in one end and get a grounded answer out the other end.
The pipeline is three steps with three names that conveniently spell out the acronym:
- Retrieve — vector search for the chunks most likely to contain the answer.
- Augment — build a prompt that includes those chunks alongside the question.
- Generate — call Claude with the augmented prompt; get the answer.
Once you have this function written, RAG is something you call rather than something you build. The work shifts from "how do I make this work" to "how do I make this good" — picking the right chunks, the right top-K, the right prompt format, the right model. Today we get the basic shape right.
What we're building
A small knowledge base of programming-language facts — Python, JavaScript, Rust, Go, TypeScript — and an ask(question) function that takes a natural-language question and returns a grounded answer. Three test questions:
- "Who created Python and when?"
- "What is Rust used for?"
- "Which language was made by Google?"
For each, the pipeline retrieves the relevant document(s), builds a prompt, asks Claude, prints the answer.
The script
import chromadb
from anthropic import Anthropic
client = Anthropic()
db = chromadb.Client()
collection = db.create_collection(name="knowledge")
docs = [
"Python was created by Guido van Rossum and first released in 1991...",
"JavaScript was created by Brendan Eich in 1995 for Netscape Navigator...",
"Rust was created by Mozilla and first released in 2015...",
"Go was created by Google in 2009...",
"TypeScript was created by Microsoft in 2012...",
]
collection.add(
documents=docs,
ids=[f"doc_{i}" for i in range(len(docs))],
)
def ask(question):
# 1. RETRIEVE
results = collection.query(query_texts=[question], n_results=2)
docs_found = results["documents"][0]
context = "\n\n".join(docs_found)
# 2. AUGMENT
prompt = f"""Use the following context to answer the question.
If the answer is not in the context, say you do not know.
Context:
{context}
Question: {question}"""
# 3. GENERATE
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
system="Answer based only on the provided context. Respond in plain text only. No markdown, no bullet points, no formatting.",
messages=[{"role": "user", "content": prompt}],
)
print(response.content[0].text)
That's the whole pipeline. Three steps, each annotated, no surprises.
Step 1: Retrieve
results = collection.query(query_texts=[question], n_results=2)
docs_found = results["documents"][0]
context = "\n\n".join(docs_found)
Vector search — exactly what we built in Episode 15. We ask for n_results=2, the two chunks most semantically similar to the question. We extract the documents from the result and join them with double newlines as a separator.
Why two? It's a balance. One result is fastest and cheapest, but if the retriever guesses wrong, Claude has nothing to work with. Five results are more forgiving — even if the top guess is wrong, the answer might be in the runner-ups — but more tokens cost more money and dilute the model's focus. For a small knowledge base like ours, 2 is fine. For larger ones, 3–5 is common.
The \n\n separator helps Claude see the chunks as distinct passages. Without spacing, two chunks run together as one paragraph and the model may treat them as one source.
Step 2: Augment
prompt = f"""Use the following context to answer the question.
If the answer is not in the context, say you do not know.
Context:
{context}
Question: {question}"""
Three pieces of prompt engineering live here.
The instruction. "Use the following context to answer the question." Claude needs to be told the relationship between the context and the question. Without that framing, the model might ignore the context, or use it as flavour rather than as the source of truth.
The escape hatch. "If the answer is not in the context, say you do not know." This is the difference between a grounded RAG system and a hallucinating one. With this clause, when retrieval misses, Claude falls back to honest "I don't know" instead of inventing. Without it, the model's instinct is to be helpful and answer anyway.
The order. Context comes before the question. Models attend more to recent context, and we want the question to be the last thing in the prompt — that's the part the model is answering. Putting it first works less well in practice.
This template is the cornerstone of RAG prompting. Variations exist for fancier needs (citation requirements, multi-document handling, conversational follow-ups), but every one of them is a tweak on this basic shape.
Step 3: Generate
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
system="Answer based only on the provided context. Respond in plain text only. No markdown, no bullet points, no formatting.",
messages=[{"role": "user", "content": prompt}],
)
Standard messages.create() — same as the first dozen episodes. The novelty isn't the API call; it's what's in the user message.
The system prompt reinforces the grounding instruction once more. "Answer based only on the provided context." Saying it twice — in the prompt and in the system — is intentional. Important constraints get repeated. Models pay more attention to repeated instructions.
Watching it run
Knowledge base: 5 documents loaded
Q: Who created Python and when?
Retrieved 2 relevant documents
A: Python was created by Guido van Rossum and was first released in 1991.
Q: What is Rust used for?
Retrieved 2 relevant documents
A: Rust is used for systems programming, web assembly, and building fast command-line tools.
Q: Which language was made by Google?
Retrieved 2 relevant documents
A: Go was created by Google in 2009.
Three correct answers, all grounded in the documents. The model didn't blend in trivia from training data — "Python was created in 1991 and is named after Monty Python..." — because the context didn't say that and the prompt told it to stay grounded.
You won't see the retrieved chunks in this output (we didn't print them), but you can add print(docs_found) after the retrieve step to inspect what got pulled. If a question gets a wrong answer, that's the first place to look — was the right chunk retrieved? If yes, prompt engineering. If no, retrieval engineering.
Where to go from here
This is the minimum viable RAG. Real systems layer on:
Reranking. Retrieve 20 candidates with vector search, then use a cross-encoder model to re-score them and keep the top 3. The vector model is fast but imprecise; the cross-encoder is slower but more accurate. Two-pass retrieval gets the best of both.
Hybrid search. Combine vector similarity with keyword (BM25) search. Vectors capture meaning; keywords catch exact matches. Together they're more robust than either alone.
Query rewriting. Before retrieval, have a small LLM rewrite the user's question into a better search query. Useful when questions are conversational and the relevant text is more formal.
Conversational RAG. Multi-turn conversations need the previous context to inform the current retrieval. "Tell me more" on its own retrieves nothing useful — you need to combine it with the previous turn.
Source citations. Tag each chunk with its source and require Claude to cite. We'll do this in Episode 17.
Evaluation. Generate a test set of questions with known correct answers and measure how often the system gets them right. RAG quality is measurable. We'll touch on it in Episode 22.
Common mistakes
Top-K too small. If the retriever's top hit is wrong, you've lost. Pull at least 3–5 candidates for non-trivial knowledge bases.
Top-K too large. Sending 20 chunks per query bloats latency and dilutes attention. Plus you pay for every token. Find the right balance.
Forgetting the escape hatch. Without "say I don't know", Claude will invent. The instruction is critical for trust.
Not splitting the corpus first. Indexing entire documents means retrieval is too coarse. Always chunk first.
Embedding the wrong field. If your data has separate fields — title, body, summary — decide explicitly what gets embedded. Often the body works best, but for some domains, embedding a synthesised "title + summary" field gives better retrieval.
Re-creating the collection on every script run. ChromaDB's in-memory client throws away your data. For real apps, use PersistentClient.
What's next
Next episode: multi-document RAG with citations. Real knowledge bases have many sources — handbook, FAQ, policy, wiki. Episode 17 stores metadata alongside each chunk, filters retrieval by source, and asks Claude to cite where each fact came from.
Recap
What we did today. Wrote a 20-line ask(question) function that retrieves relevant chunks from ChromaDB, builds a prompt that includes them as context, and asks Claude to answer using only that context. Tested it on three questions and got three grounded answers. Identified the prompt-engineering details that make grounding work — instruction, escape hatch, system prompt repetition.
You have a working RAG pipeline. From here on, "doing RAG" is calling ask().
Next episode: multi-document RAG. See you in the next one.