Back to Blog

Build AI Apps with Python: Test Your AI Agent — Keyword Matching Eval Framework | Episode 22

Celest KimCelest Kim

Video: Build AI Apps with Python: Test Your AI Agent — Keyword Matching Eval Framework | Episode 22 by Taught by Celeste AI - AI Coding Coach

Watch full page →

Build AI Apps with Python: Test Your AI Agent — Keyword Matching Eval Framework

Testing your AI agent ensures it performs as expected before deployment. This example shows how to create a simple evaluation framework that runs your agent on predefined test cases, checks for expected keywords in the output, and reports pass/fail results along with an overall score.

Code

def agent(query):
  # Simulated AI agent responses for demonstration
  responses = {
    "capital of France": "The capital of France is Paris.",
    "Django language": "Django is a web framework written in Python.",
    "Linux creator": "Linux was created by Linus Torvalds."
  }
  return responses.get(query, "I don't know the answer.")

def evaluate(agent, tests):
  passed = 0
  total = len(tests)
  for i, (query, keywords) in enumerate(tests, 1):
    response = agent(query)
    # Check if all keywords appear case-insensitively in the response
    if all(keyword.lower() in response.lower() for keyword in keywords):
      print(f"Test {i}: PASS - '{query}'")
      passed += 1
    else:
      print(f"Test {i}: FAIL - '{query}'")
      print(f"  Expected keywords: {keywords}")
      print(f"  Agent response: {response}")
  score = passed / total * 100
  print(f"\nFinal score: {passed}/{total} tests passed ({score:.1f}%)")

# Define test cases: query mapped to list of expected keywords
test_cases = [
  ("capital of France", ["Paris"]),
  ("Django language", ["Python"]),
  ("Linux creator", ["Linus", "Torvalds"]),
  ("unknown question", ["answer"]),  # This will fail as agent returns "I don't know the answer."
  ("capital of France", ["paris"])   # Case-insensitive check
]

evaluate(agent, test_cases)

Key Points

  • Define test cases with expected keywords to verify agent responses.
  • Run the agent on each query and check for keyword presence case-insensitively.
  • Report pass or fail for each test along with detailed feedback on mismatches.
  • Calculate an overall score to track agent performance over time.
  • Use this framework to catch regressions before deploying your AI agent.