Part of Python AI Tutorial Series

Build AI Apps with Python: Multi-Agent Pipeline — Delegate Specialize Orchestrate | Episode 20

Celest KimCelest Kim

Video: Build AI Apps with Python: Multi-Agent Pipeline — Delegate Specialize Orchestrate | Episode 20 by Taught by Celeste AI - AI Coding Coach

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

Student code: github.com/GoCelesteAI/build-ai-apps-python/tree/main/episode20 Three specialised agents. One supervisor. Each does one job well.

So far we've built single-agent systems: one model, one set of tools, one loop. That works well for many tasks but starts to strain when the work is genuinely multi-disciplinary. "Research a topic and write a polished summary" is two jobs. "Research, write, and edit" is three. A single agent doing all three with one system prompt usually does each job worse than a specialist would.

Today we split the work. Three agents — researcher, writer, reviewer — each with their own system prompt, called in sequence by a supervisor. The output of one becomes the input of the next. No agent has to be everything; each is an expert at one slice.

This is the pipeline pattern. It's the simplest multi-agent architecture and usually the most reliable. There are fancier patterns (hierarchical orchestration, agents talking to each other, dynamic team formation) but pipelines win on production projects 80% of the time because they're predictable and debuggable.

What we're building

A research-and-summarise pipeline:

  1. Researcher — given a topic, returns three numbered facts.
  2. Writer — given the facts, writes a 2–3 sentence summary in flowing prose.
  3. Reviewer — given the summary, rates it Good or Needs Improvement.

A supervisor function calls the three agents in order and prints a final report.

We'll run it on the topic "The history of the Python programming language" and watch each specialist do their part.

The agents

def researcher(topic):
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        system="You are a research specialist. Given a topic, provide 3 key facts. Be concise — one sentence per fact. Number them 1, 2, 3.",
        messages=[{"role": "user", "content": f"Research this topic: {topic}"}],
    )
    return response.content[0].text

def writer(facts):
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        system="You are a writing specialist. Given research facts, write a clear 2-3 sentence summary for a general audience. No bullet points — flowing prose only.",
        messages=[{"role": "user", "content": f"Write a summary from these facts:\n{facts}"}],
    )
    return response.content[0].text

def reviewer(summary):
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        system="You are an editorial reviewer. Given a summary, rate it Good or Needs Improvement. If good, say why in one sentence. If it needs work, suggest one specific fix.",
        messages=[{"role": "user", "content": f"Review this summary:\n{summary}"}],
    )
    return response.content[0].text

Three functions. Each makes one Claude API call with a different system prompt. The system prompt is the role. Same model, same temperature defaults — the model behaves differently because it's been told what kind of agent to be.

Why specialised system prompts work

Same model. Three roles. Why does this produce better output than asking one agent to do everything?

The honest answer is it just does — and it does for a few reasons:

Output format compatibility. A "research" agent that produces numbered bullet points is great input for a writer that consumes them. A single agent might struggle to output bullets and prose and a critique — formatting drifts when responsibilities pile up.

Constraint focus. The researcher has one job: facts. The writer has one job: prose. The reviewer has one job: judgment. Each system prompt focuses on the constraints that matter for that role and ignores everything else.

Recoverable failure. If the researcher gives bad facts, the writer's job is harder. If the writer's summary is wordy, the reviewer can flag it. Specialisation creates checkpoints — and a place to retry without redoing the whole task.

Cognitive load. A single agent juggling three system prompts is more likely to drop a constraint than three agents each holding one. There's a real ceiling on how much guidance a model can perfectly follow at once.

The downside is cost: three API calls instead of one. But each call is small (each agent only sees its own input), and you save tokens overall on prompts because each system prompt is focused.

The supervisor

def supervisor(task):
    facts = researcher(task)
    summary = writer(facts)
    review = reviewer(summary)
    # ... print final report

The supervisor is just a Python function that calls the three agents in sequence. There's no LLM involved. The orchestration is plain code.

This is sometimes called a static workflow. The graph of agent calls is fixed at design time. You always go researcher → writer → reviewer.

The alternative is a dynamic workflow, where another LLM decides which agent to call next, possibly looping or branching. That's more flexible but harder to debug. Stick with static unless you have a clear reason not to. "Could a Python if statement do this?" is the right test — if yes, write the if statement.

What it produces

Supervisor received task: The history of the Python programming language
==================================================

Step 1: Delegating to Researcher...
  [Researcher] Researching: The history of the Python programming language
  [Researcher] Done.

  Research results:
  1. Python was created by Guido van Rossum and first released in 1991, with development beginning in late 1989 at CWI in the Netherlands.
  2. The language was named after the British comedy group Monty Python, not the snake, reflecting van Rossum's preference for a short, unique, and slightly mysterious name.
  3. Python 2 was released in 2000 and Python 3 in 2008, with the latter being a significant overhaul that broke backward compatibility.

Step 2: Delegating to Writer...
  [Writer] Writing summary from research...
  [Writer] Done.

  Summary:
  Python is a programming language created by Guido van Rossum and first released in 1991, with development beginning in 1989 at CWI in the Netherlands. The language takes its name from the British comedy group Monty Python rather than the snake. Python 2 launched in 2000 and was followed by Python 3 in 2008, which introduced significant changes that broke backward compatibility.

Step 3: Delegating to Reviewer...
  [Reviewer] Reviewing summary...
  [Reviewer] Done.

  Review:
  Good. The summary is clear, accurate, and covers the key historical milestones in three well-structured sentences appropriate for a general audience.

==================================================
FINAL REPORT
==================================================

Topic: The history of the Python programming language
...

Three agents. Three different roles. One coherent workflow. No single agent had to be researcher and writer and reviewer.

When pipelines beat single agents

A few signals that you should split into a pipeline:

The system prompt is getting long. Past 200 lines of instructions, you're probably packing too many roles into one model. Each role becomes a candidate agent.

The output format has multiple distinct sections. "Return a research section, then a summary, then a critique." That's three outputs; pipeline them.

You want to swap one part out. If you might want to upgrade the writer to use a different model, or run the reviewer on a cheaper model, separation makes that trivial.

The work has a quality-control step. Reviewer/checker/critic agents catch mistakes the producer agent makes. The two-stage produce-then-review pattern is one of the highest-value pipelines in production AI.

When pipelines lose to single agents

Conversely, a pipeline is overkill when:

  • The task is genuinely one job. Don't split for the sake of splitting.
  • Latency matters. Three sequential API calls take 3× longer than one.
  • The agents need to share rich context. Passing the full conversation between agents is wasteful; a single agent keeps it in one history.

Common mistakes

Inconsistent input/output formats between stages. If the researcher returns bullets and the writer expects paragraphs, things break. Either standardise the format or have the writer's prompt explicitly normalise.

Letting an agent see too much. The reviewer doesn't need the original task — only the summary. Less context = better focus. Pass minimum information.

No retry on bad output. If the reviewer says "needs improvement," the simplest production fix is to feed the critique back to the writer and try again. Add this loop only when you've seen real value from a single pass first.

Reinventing the supervisor as an LLM. Sometimes you really do need a smart orchestrator. Most of the time, a Python function is the right supervisor.

Charging the model with role identity. "You are Bob, the researcher..." — the persona name doesn't help. Just describe the role and constraints.

What's next

Next episode: safe AI agents. Production agents need guardrails — input filters, output filters, tool allowlists, rate limits. We'll add three layers of safety to a single agent and verify each one blocks a different category of misuse.

Recap

What we did today. Built three specialised agents — researcher, writer, reviewer — each with its own system prompt and one job. Wrote a supervisor function that calls them in sequence: researcher → writer → reviewer. Watched the pipeline produce a coherent report none of the agents could have produced alone. Discussed when pipelines beat single agents and when they lose.

You've now seen all three of the basic agent shapes: ReAct loop (Episode 18), persistent memory (Episode 19), pipelines (today). Almost every production agent is a combination of these.

Next episode: safe AI agents. See you in the next one.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.