Part of Python AI Tutorial Series

Build AI Apps with Python: Streaming Responses — Real-Time AI Output | Episode 4

Celest KimCelest Kim

Video: Build AI Apps with Python: Streaming Responses — Real-Time AI Output | Episode 4 by Taught by Celeste AI - AI Coding Coach

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

Student code: github.com/GoCelesteAI/build-ai-apps-python/tree/main/episode04 Same chatbot. One method swap. Completely different feel.

There's a moment you've probably felt without naming. You ask a question in ChatGPT or Claude.ai. The cursor blinks. Then text starts appearing — word by word, line by line — like someone typing the answer in real time. By the time the response finishes, you've already started reading it.

That's streaming. It's not just a UX flourish. It's a fundamentally different way of talking to a model: instead of waiting for the entire response and then showing it, you display each token as soon as the model produces it. The model takes the same total time. The user starts getting value much sooner.

In Episode 3 we built a chatbot that prints replies in one chunk after a multi-second pause. That feels old. Today's job is making it feel modern. The change is one method call.

What streaming actually is

When you call client.messages.create(), the SDK opens an HTTPS request, waits for the server to finish generating the entire response, then returns a single object. From your code's perspective, the call blocks until the model is done.

When you call client.messages.stream(), the SDK opens the same kind of request — but as server-sent events. The server pushes a small event each time the model emits a new token. The SDK turns those events into a Python iterator. You consume the iterator in a loop, and each loop iteration gives you a tiny piece of new text.

The model isn't faster. The total tokens generated and the total wall-clock time are the same. What changes is when you can show the user something. With stream(), the first character on screen appears in a few hundred milliseconds. The user starts reading while the model is still working.

That's the deal. Streaming is a UX choice with a small implementation cost.

What we're building

The same chatbot from Episode 3 — history list, while True loop, multi-turn memory — but the API call streams. Words flow onto the screen as Claude generates them. The conversation history still works, because we collect the streamed pieces back into one full string and append that string to history just like before.

Two changes from Episode 3:

  1. client.messages.create(...) becomes with client.messages.stream(...) as stream:
  2. We loop over stream.text_stream and print(chunk, end="", flush=True) each piece while accumulating the full response.

Everything else is identical.

The script

import os
from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()

client = Anthropic()

history = []

print("Streaming Chatbot! Type quit to exit.\n")

while True:
    user_input = input("You: ")
    if user_input.lower() == "quit":
        break

    history.append({"role": "user", "content": user_input})

    print("Claude: ", end="", flush=True)

    with client.messages.stream(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        system="You are a helpful assistant. Keep responses brief and clear.",
        messages=history,
    ) as stream:
        full_response = ""
        for text in stream.text_stream:
            print(text, end="", flush=True)
            full_response += text

    print("\n")

    history.append({"role": "assistant", "content": full_response})

print("Goodbye!")

Same shape, different API call.

The with block

with client.messages.stream(...) as stream:
    ...

stream() returns a context manager, so we use with. The with block guarantees the underlying HTTPS connection is closed cleanly when we're done — even if an exception fires mid-stream. This matters more than it looks: a leaked stream connection holds an open file descriptor and can eventually exhaust your system's connection pool. Letting with manage the lifecycle is the right habit.

Inside the block, stream is the stream object. The actual text comes from stream.text_stream, which is an iterator that yields strings.

The print loop

full_response = ""
for text in stream.text_stream:
    print(text, end="", flush=True)
    full_response += text

Two things happen on every iteration. We print the new piece to the terminal, and we accumulate it into full_response.

Why both? Because streaming is for the user; conversation history is for the next API call. The next time the loop runs, we'll send the full assistant message back as part of messages. We need the whole string in one piece. So as the chunks arrive we both display them and concatenate them.

The print() call has two non-default arguments worth understanding:

  • end="" overrides Python's default newline. Without it, every chunk would print on its own line and the output would look like a list of fragments. With end="", chunks flow together into one continuous paragraph.
  • flush=True forces Python to write to the terminal immediately rather than buffering. Without it, Python might hold onto a chunk for milliseconds (or seconds, on some terminals) waiting for a newline before flushing. With it, every chunk hits the screen the instant it arrives.

Both are required. end="" makes the layout right; flush=True makes the timing right.

The "Claude: " prefix

print("Claude: ", end="", flush=True)

This line, before the stream block, prints the Claude: label. Same end="" and flush=True for the same reasons — the prefix needs to appear on screen before the first chunk arrives, and on the same line. Without flush=True the label might lag behind the first piece of streamed text and look mistimed.

After the stream completes, we print "\n" to drop a blank line so the next You: prompt sits on its own line.

Tiny details. They make the difference between "polished CLI" and "felt rough."

Running it

:sp | terminal python % from Neovim, same as Episode 3, because we still have an input() loop. The terminal split opens. You type:

You: Explain how a CPU works in one paragraph.

Claude: A CPU, or central processing unit, is the brain...

Watch what happens during that second line. It doesn't appear all at once. Words materialise. You start reading "A CPU, or central" before the rest exists. By the time the model is finished, you're already partway through the explanation.

That's the upgrade. Total time is the same; perceived time is dramatically shorter.

Type a follow-up and notice that memory still works:

You: What does the clock speed measure?

Claude: Clock speed measures how many cycles per second a CPU can perform...

Nothing about streaming changed the conversation history pattern. We still appended the user message before the call, accumulated the response into full_response, and appended that to history after. Memory + streaming compose cleanly.

Why the production world streams

The reason every modern AI product streams is psychological, not technical. Latency-to-first-token is the metric users actually feel. A response that takes 4 seconds total but starts appearing in 300ms feels faster than a response that takes 2 seconds total but arrives all at once.

There's also a graceful-degradation benefit. If your network is slow, streaming reveals the problem incrementally — text just arrives more slowly. Without streaming, the user stares at a blinking cursor wondering whether anything is happening.

For long responses (essays, code, summaries) the difference is even sharper. A non-streaming chatbot writing a 500-word answer makes the user wait 10 seconds in silence. A streaming one starts after 300ms and the user reads at their own pace. Same model. Same answer. Different product.

When not to stream

A few cases where messages.create() is the right choice:

You need the structured output before you can act. If you're parsing the response as JSON or extracting a tool-call payload, you can't act on half a JSON object. Wait for the whole thing.

You're processing the response programmatically, not displaying it. Background jobs, batch inference, evaluation harnesses — there's no user staring at a screen. Streaming is pointless overhead.

You're aggregating across many calls. If you're fanning out 50 parallel requests and combining the answers, code is simpler with .create(). Streaming each one buys you nothing because nobody's reading them live.

Episodes 5 onward use .create() for exactly these reasons. Streaming is a UX layer; not every call is for a user to read in real time.

Common mistakes

Forgetting flush=True. The output looks correct but lags. Add it.

Using print(text) instead of print(text, end="", flush=True). You get one chunk per line — a fragmented mess. Use both arguments.

Not accumulating into full_response. If you skip this, your history will end up with empty assistant messages. The next turn confuses Claude. Always collect the stream.

Calling stream.text_stream more than once. It's a one-shot iterator. After the for-loop finishes, it's exhausted. If you need the full text again, use the full_response you accumulated, not the stream.

Trying to mix streaming with tools= or other advanced features without checking the docs. Streaming with tools is supported but the event types are different — text_delta, tool_use_start, etc. We'll cover this when we get to function calling.

What's next

This is the last episode of Phase 1 — if we pretend Episode 5 (structured output) and Episode 6 (vision) don't exist. They do. After streaming you have the tools to build any "AI front-end" — a chatbot that types in real time, remembers, and behaves with personality.

Episode 5 makes Claude return structured data instead of prose. JSON in, JSON out. That's the bridge from "AI that talks" to "AI that drives software."

Recap

What we did today. Replaced client.messages.create() with client.messages.stream() and looped over stream.text_stream, printing each chunk and accumulating the whole response into one string. Used end="" and flush=True to make the output flow naturally to the terminal. Confirmed memory still works because we append full_response to history exactly the way we appended the assistant message in Episode 3.

You haven't built a new chatbot. You've built the modern version of the same chatbot. Words appear as they're generated. The model takes the same time. The product feels twice as fast.

Next episode: structured output. Where Claude stops writing paragraphs and starts returning JSON your Python code can parse and act on directly.

See you in the next one.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.