Build AI Apps with Python: Test Your AI Agent — Keyword Matching Eval Framework | Episode 22
Video: Build AI Apps with Python: Test Your AI Agent — Keyword Matching Eval Framework | Episode 22 by Taught by Celeste AI - AI Coding Coach
Student code: github.com/GoCelesteAI/build-ai-apps-python/tree/main/episode22 Five questions. Five expected keywords. One percentage. Now you have something you can improve.
There's a moment in every AI project where you change a prompt, run it once, and think the output is better. Maybe it is. Maybe it isn't. Maybe it's better on this one input and worse on five others you didn't test. Without numbers, every change is a coin flip with extra steps.
The fix is evaluation. A small set of test cases with known correct answers (or at least known correct shape) and a script that scores the agent against them. Run it before you change anything. Run it again after. Compare. Now you know.
Today's eval harness is the simplest useful one: a list of {question, expected_keywords} dicts, an agent function, and a loop that checks whether each expected keyword appears in the agent's answer. Five tests. Pass/fail. A final percentage.
It's crude. It's also enough to start. Real AI teams build elaborate eval pipelines — LLM-as-judge, human ratings, golden datasets, regression dashboards — but they all share the same skeleton, and the skeleton is what we build today.
What we're building
A single evaluate(agent_fn, cases) function and five test cases:
- "What is the capital of France?" → expects
["Paris"] - "What language is Django written in?" → expects
["Python"] - "Who created Linux?" → expects
["Linus", "Torvalds"] - "What does HTML stand for?" → expects
["HyperText", "Markup", "Language"] - "What year was Python first released?" → expects
["1991"]
For each, we ask the agent, check whether all expected keywords appear in the answer (case-insensitive), and tally pass/fail.
The agent
def agent(question):
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
system="You are a helpful assistant. Answer concisely in one or two sentences.",
messages=[{"role": "user", "content": question}],
)
return response.content[0].text
A plain agent — same shape as Episode 1. The eval works on any function with this signature: take a string, return a string. We can swap in fancier agents (RAG, tool-using, multi-agent) and the eval doesn't change.
This decoupling matters. Eval treats the agent as a black box. As long as the function takes a question and returns an answer, it can be scored.
The harness
def evaluate(agent_fn, cases):
passed = 0
failed = 0
for i, case in enumerate(cases):
question = case["question"]
expected = case["expected"]
answer = agent_fn(question)
found = []
missing = []
for keyword in expected:
if keyword.lower() in answer.lower():
found.append(keyword)
else:
missing.append(keyword)
status = "PASS" if not missing else "FAIL"
if status == "PASS":
passed += 1
else:
failed += 1
print(f"Test {i + 1}: {question}")
print(f" Answer: {answer[:60]}...")
print(f" {status} — found {found}" + (f" missing {missing}" if missing else ""))
total = passed + failed
score = (passed / total) * 100 if total > 0 else 0
print(f"Results: {passed}/{total} passed ({score:.0f}%)")
Loop through the test cases. For each, call the agent. Check whether every expected keyword appears (case-insensitive substring match). Tally. Print a summary.
That's the whole evaluation. No magic. No framework. Twenty-five lines that turn a vibes-based opinion into a percentage.
Why keyword matching is enough to start
There are sophisticated ways to evaluate AI output:
- Have a powerful model (Claude itself) judge whether the answer is correct.
- Compute a similarity score between the answer and a reference answer.
- Have humans rate each answer on a 1–5 scale.
- Run BLEU / ROUGE / BERTScore for translation-like tasks.
All of these are worth using when you need them. But for a first eval, keyword matching wins on three counts:
It's transparent. You can read the test case and predict whether it'll pass. If a test fails, you know exactly why ("Linus not found in answer").
It's deterministic. Run the eval twice with the same agent, get the same result. LLM-as-judge has noise; keyword matching doesn't.
It's cheap. No additional API calls. No human rater pool. The eval scales freely.
It catches the easy regressions. "What's the capital of France?" should always say "Paris". If your refactor accidentally broke that, keyword matching catches it. The fancy techniques catch more, but they're not faster than this for the basics.
For most AI projects, start with keyword tests. Add fancier techniques only when you've outgrown them.
Patterns for writing test cases
A few rules of thumb for designing the cases:
One test per behaviour, not one test per feature. "Capital of France" tests basic factual recall. "What does HTML stand for?" tests acronym expansion. "Who created Linux?" tests person + last-name retrieval. Each tests something different.
Multiple keywords for multi-part answers. Notice ["HyperText", "Markup", "Language"] — all three must appear. This catches answers that drift from the canonical form. Without all three, the answer might be vague.
Avoid ambiguous correctness. "What's the best programming language?" has no expected answer. Skip such questions; eval requires ground truth.
Mix easy and hard cases. Easy cases catch regressions; hard cases push the agent. A test set that's all 100%-pass tells you nothing when something breaks.
Cover failure modes you care about. If your agent should refuse certain questions, test that it refuses. Add cases where the expected answer is "I don't know" or contains a "decline" keyword.
Running it
Agent Evaluation
============================================================
Test 1: What is the capital of France?
Answer: The capital of France is Paris.
PASS — found ['Paris']
Test 2: What language is Django written in?
Answer: Django is a web framework written in Python.
PASS — found ['Python']
Test 3: Who created Linux?
Answer: Linux was created by Linus Torvalds in 1991.
PASS — found ['Linus', 'Torvalds']
Test 4: What does HTML stand for?
Answer: HTML stands for HyperText Markup Language.
PASS — found ['HyperText', 'Markup', 'Language']
Test 5: What year was Python first released?
Answer: Python was first released in 1991.
PASS — found ['1991']
============================================================
Results: 5/5 passed (100%)
All tests passed!
5 of 5. 100%. A baseline. Now if you change the agent — switch models, modify the system prompt, add a guardrail — and the score drops, you know your change had a regression.
What this enables
Once you can score the agent, several things become possible:
A/B testing. Run two variants on the same eval. Pick the one with the higher score.
Regression detection. Run the eval in CI on every change. Refuse to merge if the score drops.
Iterative improvement. Identify the failing tests. Look at the answers. Tweak the prompt or tools. Re-run. The failing list shrinks (or it doesn't, and you know your change didn't help).
Cost vs. quality trade-offs. Run the eval on a cheaper model. If the score drops by 2 points but cost drops by 80%, you may take the trade. Without numbers, you can't make that call.
The percentage isn't the whole story — which tests pass and fail matters more than the average — but having the percentage at all is a step change in how you reason about the agent.
Where to go from here
Once keyword matching becomes your floor, layer on:
LLM-as-judge. Send each (question, expected_answer, agent_answer) triple to a more powerful model and ask "is this answer correct?" Useful when the right answer can be phrased many ways.
Reference answer similarity. Compute embedding similarity between the agent's answer and a hand-written reference. Catches answers that are close but not exact.
Multi-criterion scoring. Correctness, conciseness, tone, citation — each gets its own score. Aggregated into a profile, not a single number.
Test cases with expected refusals. "How do I hack into a system?" — expected behaviour is refusal. Keyword test for "can't help with that" or similar.
Real user data. Once you have a deployed product, harvest actual queries (with consent and anonymisation) and add them to the eval set. The best test cases are the ones your users actually ask.
Common mistakes
Eval set too small. Five tests catch only the most obvious failures. Production agents have hundreds.
Eval set too biased. If all five tests are factual recall, you're not testing reasoning, refusal, formatting, or anything else.
Test cases that aren't fully grounded. "Best language for AI?" has no canonical answer. Skip ambiguous tests.
Updating expected keywords until tests pass. This is moving the goalposts. Define the expected behaviour first; if tests fail, fix the agent, not the test.
No test for refusal cases. Agents need to refuse certain inputs. Test those refusals.
Ignoring partial passes. If a test misses one keyword out of three, that's information, not just a failure. Look at which keyword was missed; it tells you what the agent struggles with.
What's next
Next episode: the full-featured CLI agent. A capstone for Phase 4 — an interactive command-line agent that combines tools (Phase 2), structured output (Phase 1), guardrails (Episode 21), and conversation memory (Episode 3). The whole AI app you've been building, end to end.
Recap
What we did today. Wrote a plain agent and a 25-line evaluation harness with five test cases. Used case-insensitive keyword matching to score answers. Ran the eval, got 5/5, and discussed why this is enough to start even though it's far from sophisticated. Outlined the more advanced eval techniques you graduate to when keyword matching isn't enough.
You haven't built a metrics dashboard. But you've taken your AI from "I think this is better" to "the score went from 78% to 84%." That's the difference between hoping and shipping.
Next episode: full-featured CLI agent. See you in the next one.