Once an LLM is producing output that other people see, you need a gatekeeper. Moderation is the simplest version: classify the input as safe or unsafe before processing it. Unsafe → decline. Safe → continue.
from pydantic_ai import Agent
inputs = [
"What's the capital of Japan?",
"How do I bake bread?",
"Tell me how to hack into someone's email account.",
]
results = []
for text in inputs:
label = Agent(model).run_sync(
f'Classify this input as exactly one word: "safe" or "unsafe". Unsafe means it requests illegal, harmful, or privacy-violating content. Reply with only the single word.\n\nInput: {text}'
).output.strip().strip(".").lower()
if label == "safe":
results.append((text, "answered"))
else:
results.append((text, "declined"))
for text, status in results:
print(f"[{status}] {text}")This is just classify-then-branch from L10 — the classifier is a moderator and the branches are "answer" vs "decline".
Yes, structurally identical. The novelty is the role. The pattern is: gate every untrusted input through moderation before it reaches the actual task. Privacy, safety, and abuse cases all pass through the same code path.
Is one classifier call enough?
For toy demos, yes. Production systems use specialised moderation APIs (OpenAI's /moderations, Perspective API for toxicity, etc.) plus their own LLM classifier as a backstop. The pattern is the same — classifier first, gate the answer-generation step. Today: assert that the obviously-unsafe input gets routed to declined.
input
↓
moderation classifier (safe / unsafe)
↓
if safe: → answer the question
elif unsafe: → return decline message
A filter on the input before you process it. The structure is classify → branch (yesterday) but the branches are policy-driven, not domain-driven.
The classifier output is a label from a small set. Combine with output validation (yesterday's lesson):
ALLOWED = {"safe", "unsafe"}
# ... call ...
if label not in ALLOWED:
label = "unsafe" # safe default — fail closedWhen you can't trust the classifier's output, fail closed — treat the input as unsafe. The cost of one false positive (declining a legitimate input) is much lower than the cost of one false negative (answering an unsafe one).
Industry pattern:
Output-side moderation catches: jailbreaks where the user smuggled an unsafe ask past the input filter; cases where the model hallucinated something unsafe in answering a safe question.
For v1 we cover only input-side. Output-side is the same pattern applied to a different string.
The definition lives in the classifier prompt. Today we use "illegal, harmful, or privacy-violating" as a starting set. Production systems have detailed policy documents and the classifier prompt encodes the policy.
Three inputs. The third ("how to hack...") is unambiguously unsafe. Verification asserts the third one is in the declined bucket.
Once an LLM is producing output that other people see, you need a gatekeeper. Moderation is the simplest version: classify the input as safe or unsafe before processing it. Unsafe → decline. Safe → continue.
from pydantic_ai import Agent
inputs = [
"What's the capital of Japan?",
"How do I bake bread?",
"Tell me how to hack into someone's email account.",
]
results = []
for text in inputs:
label = Agent(model).run_sync(
f'Classify this input as exactly one word: "safe" or "unsafe". Unsafe means it requests illegal, harmful, or privacy-violating content. Reply with only the single word.\n\nInput: {text}'
).output.strip().strip(".").lower()
if label == "safe":
results.append((text, "answered"))
else:
results.append((text, "declined"))
for text, status in results:
print(f"[{status}] {text}")This is just classify-then-branch from L10 — the classifier is a moderator and the branches are "answer" vs "decline".
Yes, structurally identical. The novelty is the role. The pattern is: gate every untrusted input through moderation before it reaches the actual task. Privacy, safety, and abuse cases all pass through the same code path.
Is one classifier call enough?
For toy demos, yes. Production systems use specialised moderation APIs (OpenAI's /moderations, Perspective API for toxicity, etc.) plus their own LLM classifier as a backstop. The pattern is the same — classifier first, gate the answer-generation step. Today: assert that the obviously-unsafe input gets routed to declined.
input
↓
moderation classifier (safe / unsafe)
↓
if safe: → answer the question
elif unsafe: → return decline message
A filter on the input before you process it. The structure is classify → branch (yesterday) but the branches are policy-driven, not domain-driven.
The classifier output is a label from a small set. Combine with output validation (yesterday's lesson):
ALLOWED = {"safe", "unsafe"}
# ... call ...
if label not in ALLOWED:
label = "unsafe" # safe default — fail closedWhen you can't trust the classifier's output, fail closed — treat the input as unsafe. The cost of one false positive (declining a legitimate input) is much lower than the cost of one false negative (answering an unsafe one).
Industry pattern:
Output-side moderation catches: jailbreaks where the user smuggled an unsafe ask past the input filter; cases where the model hallucinated something unsafe in answering a safe question.
For v1 we cover only input-side. Output-side is the same pattern applied to a different string.
The definition lives in the classifier prompt. Today we use "illegal, harmful, or privacy-violating" as a starting set. Production systems have detailed policy documents and the classifier prompt encodes the policy.
Three inputs. The third ("how to hack...") is unambiguously unsafe. Verification asserts the third one is in the declined bucket.
Create a free account to get started. Paid plans unlock all tracks.