Build AI Apps with Python: How AI Understands Meaning — Embeddings | Episode 14
Video: Build AI Apps with Python: How AI Understands Meaning — Embeddings | Episode 14 by Taught by Celeste AI - AI Coding Coach
Student code: github.com/GoCelesteAI/build-ai-apps-python/tree/main/episode14 Text in. List of numbers out. Two similar sentences land near each other in number-space.
In Episode 13 we cut a long document into chunks. Now we need a way to ask "which of these chunks is most relevant to a user's question?" — without using Claude on every chunk for every query (which would be ruinously expensive). We need a fast, mathematical way to compare text.
That tool is the embedding. An embedding is a function — a small specialised model — that takes a string of text and returns a fixed-length list of numbers, called a vector. The magic property of these vectors: two pieces of text that mean similar things produce vectors that are close together in space.
Once you have embeddings, similarity becomes arithmetic. Compare two questions: their cosine similarity tells you how related they are. Compare a question to all the chunks in your knowledge base: pick the chunks with the highest similarity. That's retrieval.
What we're building
A tiny script that:
- Loads a sentence-transformer model (
all-MiniLM-L6-v2— small, fast, free, runs locally). - Embeds four sentences — two about cats sleeping on something, one about programming, one about weather.
- Computes the cosine similarity between every pair.
- Prints the scores so you can see the pairs that mean similar things score higher.
By the end you'll have a clear, intuition-building picture: the cat-and-kitten sentences score high together; either of them paired with "Python is a programming language" scores low.
What an embedding looks like
An embedding is just a list of floats:
embed("The cat sat on the mat")
# → [0.0312, -0.0541, 0.1102, ..., -0.0073] # 384 numbers
The all-MiniLM-L6-v2 model produces 384-dimensional vectors. Other models produce 768, 1024, 1536, even higher. The dimension is a fixed property of the model.
What do those numbers mean individually? Nothing useful to a human. The model learned during training to position similar concepts near each other in 384-dimensional space, and the coordinates are the result. Don't try to interpret a single number; the meaning is in the direction the vector points.
Cosine similarity
How do you measure "closeness" between two vectors? The standard tool is cosine similarity:
def cosine_similarity(a, b):
dot_product = sum(x * y for x, y in zip(a, b))
magnitude_a = math.sqrt(sum(x * x for x in a))
magnitude_b = math.sqrt(sum(x * x for x in b))
if magnitude_a == 0 or magnitude_b == 0:
return 0.0
return dot_product / (magnitude_a * magnitude_b)
The formula is the angle between the two vectors. Specifically: 1.0 if they point in the same direction (perfectly related), 0.0 if perpendicular (unrelated), -1.0 if opposite. In practice, embeddings of normal text rarely go negative — most pairs land between 0.2 and 0.9.
We could use Euclidean distance (straight-line distance between the vectors), but cosine is more robust to magnitude differences. Two sentences with similar content but different lengths produce vectors that point similarly but might have different magnitudes. Cosine ignores magnitude and looks at direction.
numpy and scikit-learn both ship cosine similarity, but writing it from scratch in 7 lines makes it concrete. There's no magic.
Loading the model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
def embed(text):
return model.encode(text).tolist()
The first time you run this, sentence-transformers downloads the all-MiniLM-L6-v2 model from Hugging Face — about 80MB. After that it's cached locally and the script starts in a couple of seconds.
Why this model? Three reasons:
- Small. 22 million parameters. Loads fast, runs fast.
- Local. No API key, no per-call cost, no network latency.
- Good enough. For most semantic-search use cases, it works well. Not the absolute best, but excellent value.
For higher quality you can swap in larger models — all-mpnet-base-v2 (110M params, slower but more accurate), bge-large-en-v1.5 (335M params, top-tier quality), or paid hosted services like Anthropic's, OpenAI's, or Voyage's embedding APIs. The interface is the same: text in, list of floats out. We'll stay local for the tutorial.
Running it
:!python %. The script embeds four sentences and prints all six pairwise similarity scores.
Vector size: 384 dimensions
=== Similarity scores ===
0.749 'The cat sat on the mat'
vs 'A kitten rested on the rug'
0.117 'The cat sat on the mat'
vs 'Python is a programming language'
0.083 'The cat sat on the mat'
vs 'The weather is sunny today'
0.142 'A kitten rested on the rug'
vs 'Python is a programming language'
0.069 'A kitten rested on the rug'
vs 'The weather is sunny today'
0.054 'Python is a programming language'
vs 'The weather is sunny today'
The two sentences about cats score 0.749 — the model recognises that cat sat on mat and kitten rested on rug are saying very similar things, even though they share almost no words. ("cat" and "kitten", "sat" and "rested", "mat" and "rug" — different words, related concepts.)
The cat sentences against the programming or weather sentences score around 0.1 — close to unrelated.
This is the property RAG depends on. "How many days of leave do I get?" embeds close to "Annual leave: 18 days per year" even though only one word — "leave" — overlaps. Vector search finds the right chunk regardless of word choice.
The training trick (in one paragraph)
How does the model know that cat and kitten should be near each other? It was trained on hundreds of millions of sentence pairs labelled by humans (or harvested from web data) as related or unrelated. During training, the model adjusts itself so that related pairs produce vectors with high cosine similarity, and unrelated pairs produce low similarity. After enough examples, the model generalises to text it has never seen.
That's the whole architecture in a paragraph. Embedding models are much simpler than language models. They're also much smaller, faster, and cheaper to run — which is why we can comfortably embed millions of chunks locally for cents.
What this enables
Once you can embed text and compare embeddings, you have:
- Semantic search: "find documents related to my query" without keyword overlap.
- Clustering: group similar items.
- Deduplication: detect near-duplicate text by similarity threshold.
- Recommendation: "users who liked X also liked Y" based on item embeddings.
- Classification: assign a class by similarity to labelled examples.
Episode 15 turns this primitive into a database. Instead of computing similarities ourselves, we'll let ChromaDB store thousands of embeddings and answer "give me the top-K most similar items to this query" in milliseconds.
Common mistakes
Comparing embeddings from different models. Vectors from all-MiniLM-L6-v2 are not comparable to vectors from bge-large. They live in different spaces. Always use the same model for both your indexed documents and your query.
Not normalising when needed. Cosine similarity already normalises by magnitude. If you use Euclidean distance instead, normalise first or your magnitude differences will dominate.
Embedding too much text per chunk. Most embedding models have an input cap (often 256–512 tokens). Text past that gets truncated, and the embedding represents only the head. Keep chunks within the model's limit.
Treating embeddings like language-model knowledge. Embeddings encode similarity, not facts. They don't know the answer to "how many days of leave". They just know which chunk is most likely to contain it. The generation step (Claude) does the answering.
Re-embedding the same text repeatedly. Embeddings are stable for a given model. Compute them once, store them, reuse them. Especially important when you're paying per call for a hosted embedding API.
What's next
Next episode: vector store with ChromaDB. Today we computed similarity by hand, comparing four sentences. With thousands of chunks, you need a database. ChromaDB stores embeddings, indexes them for fast nearest-neighbour search, and lets you query by similarity in one line.
Recap
What we did today. Loaded a small local embedding model. Wrote cosine_similarity() from scratch in seven lines. Embedded four sentences and printed the six pairwise scores. Saw that two sentences about cats — sharing no identical words — landed near each other while a sentence about programming landed far. Established the property that makes vector search possible.
You haven't built retrieval yet. You've built the comparison primitive that retrieval is made of.
Next episode: vector store. See you in the next one.