Build AI Apps with Python: Safe AI Agents — Input Validation and Output Filtering | Episode 21
Video: Build AI Apps with Python: Safe AI Agents — Input Validation and Output Filtering | Episode 21 by Taught by Celeste AI - AI Coding Coach
Student code: github.com/GoCelesteAI/build-ai-apps-python/tree/main/episode21 Three layers — block bad input, redact bad output, lock down which tools the model can call.
Once an AI agent is exposed to real users — let alone real tools — you have to think about guardrails. Not in the abstract. In code. There are three places where things can go wrong:
- Input — the user asks for something they shouldn't (or something poisonous: a prompt injection, a request for harm).
- Output — the model returns something it shouldn't (sensitive data, harmful content, leaked credentials).
- Action — the model tries to use a tool it shouldn't.
Each of those failure modes deserves its own guardrail. None of the three is a single line of code that "fixes safety." Together they form a defence-in-depth pattern that keeps the agent useful while preventing the easy failure modes.
This episode is the basic version of all three. Real production guardrails layer in more sophisticated tools — content classifiers, rate limiters, audit logs, human-in-the-loop approval — but the architecture is what we cover today.
What we're building
Three guardrail functions:
check_input(text)— refuses requests containing blocked keywords or that are too long.check_output(text)— redacts sensitive patterns (SSNs, credit-card numbers, emails) before showing the response.check_tool(tool_name)— confirms a tool is on an allowlist before executing it.
We wire the first two into a guarded_agent(question) function, then test it against:
- A normal question (passes both checks).
- A blocked-keyword question (input guard fires).
- A question whose answer contains an SSN and an email (output guard fires).
- A demo of the tool allowlist (showing which tools pass and which don't).
Input guardrail: keyword denylist
BLOCKED_TOPICS = ["hack", "exploit", "steal", "weapon", "illegal"]
def check_input(text):
lower = text.lower()
for word in BLOCKED_TOPICS:
if word in lower:
return False, f"Blocked: contains '{word}'"
if len(text) > 500:
return False, "Blocked: input too long (max 500 chars)"
return True, "OK"
A keyword denylist is the simplest input guardrail. It's also the crudest — words can match in legitimate contexts ("how do I hack on this open-source project?" gets falsely blocked), and clever phrasing slips past simple lists.
For production, you'd layer in:
- A small classifier model (a fine-tuned BERT, a small Claude call) that decides if the request is safe.
- Rate limits per user.
- Audit logs.
- Length caps and rate caps to prevent denial-of-service.
- Prompt-injection detection — patterns like "ignore previous instructions" or hidden instructions inside user-supplied content.
The goal of an input guardrail isn't zero false positives or false negatives. It's cheap, fast, transparent rejection of the obvious cases, leaving the model free to handle everything else. The keyword list is fine as a tutorial-grade first line of defence; just don't ship it as your only line of defence.
Output guardrail: regex redaction
SENSITIVE_PATTERNS = [
r"\b\d{3}-\d{2}-\d{4}\b", # SSN
r"\b\d{16}\b", # Credit card
r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", # Email
]
def check_output(text):
filtered = text
for pattern in SENSITIVE_PATTERNS:
filtered = re.sub(pattern, "[REDACTED]", filtered)
changed = filtered != text
return filtered, changed
Regex patterns for things that look like sensitive data. Each match gets replaced with [REDACTED]. The function returns the filtered text and a flag saying whether any redactions happened, so the caller can log it.
The reason output guardrails matter even when the model is "behaving":
Models accidentally surface training data or context window contents. If your system prompt includes a database key (don't!), a determined user can sometimes get the model to leak it. Redacting from the output is a backstop.
Tools return sensitive data. A tool that queries a database might pull back fields you intended to filter. The agent might quote that data verbatim in its answer. The output guard catches it.
Models hallucinate plausibly-formatted secrets. A model asked to "give an example email address" will produce something that looks like a real email. Sometimes that's fine; sometimes you want it redacted (e.g., in a customer-support context where any email-shaped string is a privacy risk).
Production output guardrails go beyond regex: classifiers for PII, profanity, brand-safety violations, code that could be malicious. Same architectural principle — postprocess before display.
Tool guardrail: allowlist
ALLOWED_TOOLS = ["search", "calculate"]
def check_tool(tool_name):
if tool_name in ALLOWED_TOOLS:
return True, "OK"
return False, f"Blocked: tool '{tool_name}' not in allowlist"
The principle here is allowlist, not denylist. Don't try to enumerate the tools you forbid. Enumerate the tools you allow.
This is the difference between "deny if matches a bad list" and "deny unless matches a good list." The first fails open — anything you forgot to list is allowed. The second fails closed — anything you forgot to list is denied. For tools that touch reality (file system, network, database, external API), always allowlist.
In our test:
search: ALLOWED — OK
calculate: ALLOWED — OK
delete_file: BLOCKED — Blocked: tool 'delete_file' not in allowlist
send_email: BLOCKED — Blocked: tool 'send_email' not in allowlist
delete_file and send_email are blocked because they aren't on the allowlist. They could be added with a deliberate decision (and probably some additional safeguards, like requiring human approval for destructive actions).
Wiring it together
def guarded_agent(question):
# Step 1: Input guardrail
allowed, reason = check_input(question)
if not allowed:
print(f"INPUT BLOCKED: {reason}")
return
# Step 2: Call Claude
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
system="You are a helpful assistant. Answer questions concisely.",
messages=[{"role": "user", "content": question}],
)
raw_output = response.content[0].text
# Step 3: Output guardrail
filtered, was_filtered = check_output(raw_output)
print(f"\nAnswer: {filtered}")
Three blocks. Input check first — if it fails, we never spend a token on the API call. Then the model. Then output check before display.
Not every guardrail produces a block. Some redact (output), some warn (a soft yellow flag for the user), some delay (queue the request for human review). The pattern is the same: pre-call check, model call, post-call check.
Defence in depth
No single guardrail catches everything. The real safety property comes from layering.
- A user might bypass the keyword filter by misspelling. The output guard might still catch the response if it contains sensitive data.
- A clever prompt injection might trick the model. The tool allowlist still prevents it from calling destructive tools.
- The output guard might miss a novel pattern. Logging and human review catch it later.
Production agents have many guardrails — half a dozen or more — each catching a different failure mode. Today we built the three most fundamental.
What we didn't build
For honesty, here's what production agents add on top of today's pattern:
- Rate limiting per user, per IP, per token cost.
- Cost caps that disable the agent for a user who's spent more than $X this month.
- Human-in-the-loop approval for destructive tools (the model can propose
delete_file, but a human must confirm). - Audit logging of every input, output, and tool call. Required for compliance and incident response.
- Prompt-injection detection — explicit checks for "ignore previous instructions" and similar patterns inside user-supplied data (especially file contents, email bodies, and other untrusted input).
- Egress filtering — the agent's outputs are routed through additional classifiers before reaching the user.
If you're building a real product, plan for all of these. None of them are optional once a malicious user is in your audience.
Common mistakes
Treating one guardrail as enough. Defence in depth or you don't have defence.
Denylists for tools. Always allowlist. Anything not on the list is denied.
Skipping the input check when the call is "internal." Even internal callers can pass user-derived data. Validate at every trust boundary.
Logging the raw input/output without redaction. Your logs become a privacy disaster. Apply the output guard before logging.
Putting secrets in the system prompt. Don't. The model can be coaxed into surfacing them.
Trusting the model to enforce its own constraints. "Do not output sensitive data" in the system prompt is helpful but insufficient. Always have an external check.
What's next
Next episode: evaluating agents. How do you tell if your agent is getting better? You write test cases. We'll build a tiny eval harness — a list of questions with expected keywords in the answer — and use it to score the agent. Once you have a measurable score, iteration becomes possible.
Recap
What we did today. Built three guardrail functions — input keyword block, output regex redaction, tool allowlist. Wired input and output guards into a guarded_agent and confirmed each one fires on the right kind of misuse. Discussed defence in depth and the long list of additional guardrails real production agents layer on.
You haven't shipped a safe agent. But you've built the architectural shape every safe agent has, and you understand what kinds of failures each layer catches.
Next episode: evaluating agents. See you in the next one.