Build AI Apps with Python: Test Your AI Agent — Keyword Matching Eval Framework | Episode 22
Video: Build AI Apps with Python: Test Your AI Agent — Keyword Matching Eval Framework | Episode 22 by Taught by Celeste AI - AI Coding Coach
Watch full page →Build AI Apps with Python: Test Your AI Agent — Keyword Matching Eval Framework
Testing your AI agent ensures it performs as expected before deployment. This example shows how to create a simple evaluation framework that runs your agent on predefined test cases, checks for expected keywords in the output, and reports pass/fail results along with an overall score.
Code
def agent(query):
# Simulated AI agent responses for demonstration
responses = {
"capital of France": "The capital of France is Paris.",
"Django language": "Django is a web framework written in Python.",
"Linux creator": "Linux was created by Linus Torvalds."
}
return responses.get(query, "I don't know the answer.")
def evaluate(agent, tests):
passed = 0
total = len(tests)
for i, (query, keywords) in enumerate(tests, 1):
response = agent(query)
# Check if all keywords appear case-insensitively in the response
if all(keyword.lower() in response.lower() for keyword in keywords):
print(f"Test {i}: PASS - '{query}'")
passed += 1
else:
print(f"Test {i}: FAIL - '{query}'")
print(f" Expected keywords: {keywords}")
print(f" Agent response: {response}")
score = passed / total * 100
print(f"\nFinal score: {passed}/{total} tests passed ({score:.1f}%)")
# Define test cases: query mapped to list of expected keywords
test_cases = [
("capital of France", ["Paris"]),
("Django language", ["Python"]),
("Linux creator", ["Linus", "Torvalds"]),
("unknown question", ["answer"]), # This will fail as agent returns "I don't know the answer."
("capital of France", ["paris"]) # Case-insensitive check
]
evaluate(agent, test_cases)
Key Points
- Define test cases with expected keywords to verify agent responses.
- Run the agent on each query and check for keyword presence case-insensitively.
- Report pass or fail for each test along with detailed feedback on mismatches.
- Calculate an overall score to track agent performance over time.
- Use this framework to catch regressions before deploying your AI agent.