Part of Python AI Tutorial Series

Build AI Apps with Python: Prompt Engineering Patterns — Few-Shot Chain-of-Thought Role | Episode 24

Celest KimCelest Kim

Video: Build AI Apps with Python: Prompt Engineering Patterns — Few-Shot Chain-of-Thought Role | Episode 24 by Taught by Celeste AI - AI Coding Coach

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

Student code: github.com/GoCelesteAI/build-ai-apps-python/tree/main/episode24 Same model. Same question. Different prompt. Dramatically different output.

The series finale. We've spent twenty-three episodes on architecture — APIs, tools, memory, RAG, agents, guardrails, evaluation. Today is about the content you put inside the prompt. Same Claude, same call, same parameters — but the answer's quality, format, and reliability change a lot depending on how the prompt is written.

Four patterns. Each is one of the highest-leverage moves in prompt engineering. None of them require a single line of code change in your agent loop; the change is purely in the strings you pass to system or messages.

Pattern 1: Zero-shot vs few-shot

Zero-shot means asking the model to do something with only an instruction:

"Classify the sentiment as positive, negative, or mixed."

Few-shot means asking the same thing but providing one or two examples first:

"Classify the sentiment as positive, negative, or mixed.

Examples: "Great product, love it!" -> positive "Terrible quality, broke in a day." -> negative "Good features but overpriced." -> mixed

Now classify: The battery life is amazing but the screen is too dim."

In both cases the model knows what to do. The difference is what it produces.

Zero-shot output is usually correct but verbose: "This review expresses mixed sentiment, since it praises the battery life while criticizing the screen brightness." That's three clauses for what should be one word.

Few-shot output mirrors the format of the examples: just mixed. The model copies the shape of the demonstrations.

When to use few-shot. Whenever output format consistency matters. Classification, extraction, structured outputs, code generation in a specific style. Two or three examples are usually enough.

When zero-shot is fine. Free-form prose. Open-ended questions. Anything where you don't have a strict format in mind.

The trade-off. Few-shot examples eat tokens. For a 10-token answer, a 200-token few-shot prompt is wasteful in cost-per-call. If you're calling the API at scale, look at whether the format consistency is worth the token tax. Often it is, but verify.

Pattern 2: Chain-of-thought

Take a math word problem:

A store has 24 apples. They sell 8 in the morning and receive a shipment of 15. How many do they have?

Direct prompt: "Answer the question." The model says 31. Often correct on small problems. Wrong more often than you'd guess on subtler ones.

Chain-of-thought prompt: "Think step by step. Show your reasoning before giving the answer." The model produces:

Start: 24 apples. Sell 8 → 24 - 8 = 16. Receive 15 → 16 + 15 = 31. Answer: 31.

The arithmetic is the same. Why does it matter?

Because the model's probability of being correct improves when it produces reasoning tokens. The act of writing the steps changes the next-token distribution to favour answers that follow from those steps. It's not magic; it's a structural property of how autoregressive models work.

For simple questions, chain-of-thought produces longer answers without much benefit. For multi-step reasoning — math, logic, planning, code generation, anything with a because — it materially improves correctness. We saw this in Episode 18 with ReAct agents: "think step by step" in the system prompt is doing exactly this.

A few specific phrases that work:

  • "Think step by step."
  • "Show your reasoning before giving the answer."
  • "Let's work through this carefully."
  • "First, identify what's being asked. Then, list the relevant facts. Finally, compute the answer."

The phrasings differ in detail but they all share the same goal: get the model to write reasoning before committing to an answer.

Pattern 3: Role prompting

Generic system prompt: "You are a helpful assistant." Ask: "Explain what an API is." The answer is competent and dense — talks about endpoints, requests, responses, perhaps some technical jargon.

Role-specific prompt: "You are a patient teacher explaining to a complete beginner. Use simple analogies. No jargon." Same question. The answer is now an analogy — "An API is like a waiter at a restaurant. You give the waiter your order, they take it to the kitchen, and they bring back what was made..."

Same model. Same question. The role transformed the answer from a definition into an explanation a beginner can follow.

Role prompting is a more refined version of the system-prompt persona pattern from Episode 2. The key insight is that role implies audience and constraints. "Patient teacher for beginners" says: assume zero prior knowledge, use analogies, avoid jargon. The model imports all of those constraints from one phrase.

Patterns that work:

  • "You are an X explaining to a Y." (audience-aware)
  • "You are an X who specialises in Z." (scope-restricted)
  • "You are an X. Be Z." (style-constrained)

Patterns that don't work:

  • "You are an expert." — too vague.
  • "You are a senior software engineer with 20 years of experience." — flattering noise; doesn't change behaviour.
  • "Your name is Bob." — meaningless.

The role should imply specific behaviours. If you can't name the behaviour the role enforces, drop the role.

Pattern 4: Output format control

Without instructions: "List 3 benefits of code reviews." The model often returns a wordy answer with an introduction ("Code reviews offer several benefits, including..."), a list, and a conclusion. Helpful, but if you're parsing the output you have to strip the prose.

With instructions: "Respond with a numbered list. One sentence per item. No introductions or conclusions." Now you get:

1. They catch bugs before they reach production.
2. They spread knowledge of the codebase across the team.
3. They improve consistency in style and architecture.

Three sentences. No fluff. Easy to parse, easy to display.

This is the structured-output pattern from Episode 5 in its lighter form. Episode 5 asked for full JSON; today's prompt asks for a numbered list. Same idea — tell the model exactly what shape you want — applied at less formal scale.

Common format constraints:

  • "Respond in plain text. No markdown."
  • "One sentence per bullet. No introductions."
  • "Return only the answer. No reasoning." (opposite of chain-of-thought; useful when the consumer is code)
  • "Use this format exactly: [TEMPLATE]."
  • "Reply only with yes or no."

The narrower the constraint, the more reliable the format. Telling the model "be concise" is vague; telling it "reply in 10 words or fewer" is precise.

Combining patterns

The patterns compose. A good system prompt for a serious task might use all four:

You are a senior code reviewer giving feedback to a junior engineer. (role)

For each suggestion, first explain the reasoning, then give the specific change. (chain-of-thought + format)

Examples: [example 1] [example 2] (few-shot)

Output format: 1. Reasoning: [...] Change: [...] 2. Reasoning: [...] Change: [...] (format)

Five elements: role, audience, reasoning instruction, examples, output format. None of them are expensive in tokens (a couple of hundred for a system prompt that gets reused across thousands of calls is great value). Together they shape the model toward exactly the response you want.

Where to invest

If you have one hour to spend on improving an AI feature, this is the order of impact:

  1. Format control. Lowest effort, immediate clarity benefit.
  2. Chain-of-thought. A single phrase, large quality jump on reasoning tasks.
  3. Role. Reframes the answer to an appropriate audience.
  4. Few-shot. When format consistency matters or the task is unusual.

Add patterns one at a time. After each addition, look at outputs. Run your eval (Episode 22). Don't bolt them all on at once or you can't tell what helped.

What this episode (and the series) was really about

The series was nominally about the Claude API. In practice it was about a style of building software. Once you've internalised it, you can apply it to any LLM:

  • API in. Text out.
  • Wrap with system prompts and structured output.
  • Add tools when you need actions.
  • Add memory when you need state.
  • Add RAG when you need facts.
  • Add guardrails when you need safety.
  • Evaluate continuously when you want to know whether you're getting better.

That's the whole template. Twenty-four episodes condensed into seven principles.

Common mistakes

Stacking unproven patterns. Adding role, few-shot, chain-of-thought, and format all at once on a task where one of them was already enough. Diminishing returns and bloated prompts.

Few-shot examples that contradict each other. "positive: amazing battery" and "positive: fast and slow" — the model learns whichever pattern is closest to the query. Examples should be unambiguous.

Role descriptions that don't constrain behaviour. "You are a senior assistant" doesn't mean anything. "You are a senior assistant. Refuse questions outside your domain." — now the role does work.

Format instructions that conflict with the model's training defaults. Asking for plain text but mentioning headers in the same prompt produces inconsistent output. Be coherent.

Forgetting that all four patterns live in system and user. Both are fair game. Big constraints in system, examples and per-query format hints in user.

What's next (after this series)

Topics not covered today that would be a natural follow-up:

  • Streaming with tools. Production agents stream their reasoning to the user as it's generated.
  • Prompt caching. Anthropic's caching feature lets you reuse a long system prompt across calls without paying for it each time. Big cost win for anyone with a stable system prompt.
  • Batched inference. When you have many independent calls, batching is often cheaper.
  • Fine-tuning vs prompting. When does it actually pay off to fine-tune? Almost never, for most teams. But when it does, the gains are real.
  • Production observability. Logging, tracing, replay tools that let you debug AI features the way you debug regular software.

Each of those is its own deep dive. With the foundations from these 24 episodes, you can pick them up by reading docs and trying them — the muscle for "build, test, evaluate, iterate" is in place.

Recap

What we did today. Showed four prompt-engineering patterns side by side: zero-shot vs few-shot, direct vs chain-of-thought, generic vs role, unformatted vs formatted. Each was the same question with two different prompts, demonstrating that prompt design materially shapes output. Identified when each pattern helps and when it's overkill. Discussed how the patterns compose.

You haven't built anything new today. You've finished the toolkit.

You can build AI apps now. The series ends here. What you do with it is up to you.

Thanks for following along. See you in the next series.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.