Part of Python AI Tutorial Series

Build AI Apps with Python: Vision — AI That Can See Images | Episode 6

Celest Kim

•April 18, 2026

Video: Build AI Apps with Python: Vision — AI That Can See Images | Episode 6 by Taught by Celeste AI - AI Coding Coach

Take the quiz on the full lesson page

Test what you've read · interactive walkthrough

Student code: github.com/GoCelesteAI/build-ai-apps-python/tree/main/episode06 Same API. New input type. Suddenly your script can describe images.

Until now, every message we've sent Claude has been text. The first five episodes built up a real toolbox — questions, system prompts, multi-turn memory, streaming, structured output — but the input has always been a string.

Today we drop that constraint. Claude is multimodal. The same messages.create() call accepts images. Send a picture and a question, get back a description, an analysis, a transcription, an answer about what's in the frame. The API hasn't changed. The shape of the content field expands from "a string" to "a list of blocks, some of which are images."

This is the last episode of Phase 1. After this, the front end of an AI app — text, conversation, structure, sight — is fully under your fingers. From Episode 7 we move into Phase 2 and start giving the model abilities of its own.

What we're building

A describe_image() function. Pass it a path to a PNG or JPEG file and it returns a paragraph describing what's in the image. We'll feed it two photos — sunset.png and city.png — and let Claude tell us what it sees.

The script is short, but it introduces three new mechanics: base64 encoding, multi-modal content blocks, and media types. Each one is a small idea. Together they're the whole vision pipeline.

The script

import os
import base64
from dotenv import load_dotenv
from anthropic import Anthropic

load_dotenv()

client = Anthropic()

def describe_image(image_path):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")

    ext = image_path.split(".")[-1].lower()
    media_type = f"image/{ext}"
    if ext == "jpg":
        media_type = "image/jpeg"

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": image_data,
                        },
                    },
                    {
                        "type": "text",
                        "text": "Describe this image in detail.",
                    },
                ],
            }
        ],
    )

    return response.content[0].text

print("=== Image 1: sunset.png ===")
print(describe_image("sunset.png"))

print("\n=== Image 2: city.png ===")
print(describe_image("city.png"))

Three new ideas inside one function. Let's walk them.

Base64 encoding

with open(image_path, "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

Images are binary. They're sequences of bytes that include null characters, control characters, and arbitrary 8-bit values. JSON, on the other hand, is text. You can't just stuff raw image bytes into a JSON field — most of those bytes aren't valid in a JSON string.

Base64 is the standard fix. It encodes any binary blob as a string of 64 specific ASCII characters (A–Z, a–z, 0–9, plus + and /, with = for padding). The encoded form is about 33% larger than the original, but it's safe inside JSON, HTTP headers, URLs, anywhere text is expected.

The two steps:

open(image_path, "rb") — read in binary mode. The "b" matters; without it, Python would try to decode the bytes as text and corrupt the data.
base64.b64encode(f.read()) returns bytes (not a string). The .decode("utf-8") converts those bytes into a Python string we can put in a dict.

The result is a long ASCII string — for a typical image, tens of thousands of characters — that the API can carry inside a JSON request body.

Media type

ext = image_path.split(".")[-1].lower()
media_type = f"image/{ext}"
if ext == "jpg":
    media_type = "image/jpeg"

The API needs to know what the bytes are. PNG and JPEG and GIF and WebP all decode differently. Anthropic's servers don't try to sniff the format from the bytes — you tell them.

media_type is a MIME type: "image/png", "image/jpeg", "image/gif", "image/webp". The supported types are the common ones; check the docs for the current list.

The little if ext == "jpg" guard exists because file extensions and MIME types don't quite line up. The extension is jpg (or jpeg), but the MIME type is always image/jpeg. Without that guard, a file named photo.jpg would produce image/jpg, which isn't a valid MIME type and the API would reject it.

For production code you'd use Python's mimetypes module or a more thorough mapping, but for a tutorial this two-line guard handles the common case.

The multi-modal content list

This is the conceptually new piece, and it's worth pausing on:

"content": [
    {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": media_type,
            "data": image_data,
        },
    },
    {
        "type": "text",
        "text": "Describe this image in detail.",
    },
],

In every previous episode, content was a string:

{"role": "user", "content": "What is Python?"}

When you need to mix media types, content becomes a list of blocks. Each block has a type. Today we use two:

type: "image" with a source describing how the image is delivered (base64 here, or url if you want the API to fetch the image itself).
type: "text" with the actual text prompt.

You can stack as many blocks as you like, in any order. Image, image, text. Text, image, text, image, text. The model reads them in order and the response is generated against the whole sequence. This is how you'd ask "compare these two images" — two image blocks followed by a text block.

The string-content form you've been using is just shorthand for content: [{"type": "text", "text": "..."}]. Once you understand the list form, the string form is a special case.

Running it

:!python %. The script reads sunset.png, encodes it, sends it to Claude with the question "Describe this image in detail." The response comes back as text, exactly like every other call:

=== Image 1: sunset.png ===
The image shows a vibrant sunset over a calm body of water. The sky is painted in
warm shades of orange, pink, and deep red, with scattered clouds catching the last
light of day. The horizon line is low, with the sun partially submerged, casting a
golden reflection across the rippling water below...

=== Image 2: city.png ===
This is a panoramic view of a dense urban skyline at dusk. Tall glass-and-steel
buildings dominate the foreground, with their windows lit warmly against the
fading blue of the evening sky. A river or harbour is visible at the base of the
buildings, reflecting the city lights...

Two completely different scenes. Two specific, accurate descriptions. Same function, different inputs.

What vision lets you build

A vision-capable API turns whole categories of problems into "send the picture, parse the answer":

OCR-like text extraction. Send a screenshot of a receipt, get back the line items.
Document understanding. Send a scanned PDF page, get a summary or structured fields.
Accessibility. Send a UI screenshot, get an alt-text description.
Visual QA. Send a chart, ask "what's the trend in Q3?" — get an answer.
Moderation. Send user-uploaded images, classify against a policy.
Inventory and identification. Send a photo of a product, get the category, brand, condition.

You don't need a separate computer-vision pipeline for any of these. The same Claude API that wrote a sentence about Python in Episode 1 can analyse a chart, transcribe handwriting, and describe a city skyline. It costs more per call than a text-only call (images take a lot of tokens — a typical 1024×1024 image is roughly 1,600 tokens) but the engineering simplification is enormous.

Combining vision with everything else

The image block is just another content block. That means it composes with every pattern you've already learned:

System prompts. "You are an OCR assistant. Return only the text visible in the image."
Multi-turn. Send an image in turn one, ask follow-up questions in turn two without re-sending it.
Streaming. Image in, streamed text description out.
Structured output. Image in, JSON out. "Return {detected_objects: [...], confidence: 0..1, primary_subject: ...}."

That last one is huge. Image → JSON is the whole field of "computer vision plumbing" reduced to a single API call with a good system prompt.

The URL alternative

We used source.type: "base64" because the image lives on disk. The other option is source.type: "url":

"source": {
    "type": "url",
    "url": "https://example.com/photo.jpg",
}

When you pass a URL, Anthropic's servers fetch the image themselves. Pros: smaller request body, no base64 overhead. Cons: the URL has to be publicly reachable; your private file paths don't work.

Use base64 for files on your machine or generated in memory. Use URLs for images already hosted online.

Common mistakes

Forgetting "rb" mode. Reading a binary file as text corrupts the bytes. Always open(path, "rb") for images.

Forgetting .decode("utf-8"). base64.b64encode() returns bytes; the API expects a string. Without the decode you get a TypeError deep inside the SDK.

Wrong media type. image/jpg is not valid; it has to be image/jpeg. Same for any other format mismatch.

Sending an oversized image. Vision tokens are not free. A 4K photo will burn thousands of tokens per call. For most use cases, downscale to ~1024px on the long edge before encoding. Quality is rarely the issue; cost and latency are.

Putting the question before the image. It still works, but ordering matters: the model attends to context in order, and putting the question last (right before the answer it needs to generate) tends to give better results. Image first, question last.

What's next

This episode closes Phase 1. You can now make Claude talk, remember, stream, return data, and see. That's the entire surface area of the API for a single user-facing app.

From Episode 7 we change posture. So far Claude has only been an output — a thing that produces text given input. Phase 2 turns it into an agent of action. We'll define Python functions, describe them in JSON schemas, hand the schemas to Claude as tools, and watch the model decide which function to call and with what arguments. The model becomes a router across your code.

The first one we'll build is a currency converter. Six lines of Python and a JSON schema, and Claude will figure out from a sentence like "Convert 100 USD to SGD" exactly which function to call and what to pass.

Recap

What we did today. Read an image as binary. Encoded it with base64.b64encode(). Wrapped it in a content block with the right media type. Sent it as part of a user message alongside a text block. Parsed the response the same way we have for five episodes. Watched Claude describe two completely different scenes accurately.

You haven't built a vision system. You've added one new content-block type to the same API call you've been making since Episode 1. Everything you already know — system prompts, conversations, streaming, structured output — composes with it.

Next episode: function calling. Where Claude stops being something you prompt and starts being something that acts.

See you in the next one.

Ready? Take the quiz on the full lesson page →

Test what you've learned. Watch the lesson and try the interactive quiz on the same page.

View all episodes in Python AI Tutorial Series →