Production rule: customer text often contains PII — emails, phone numbers, IDs. Sending it raw to a third-party LLM is a leak. Redact before send.
import re
PATTERNS = {
"EMAIL": r'[\w.+-]+@[\w-]+\.[\w.-]+',
"PHONE": r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b',
"SSN": r'\b\d{3}-\d{2}-\d{4}\b',
}
def redact(text):
for label, pattern in PATTERNS.items():
text = re.sub(pattern, f'[{label}]', text)
return text
print(redact("Email me at jane@acme.com or call 555-123-4567"))
# → Email me at [EMAIL] or call [PHONE]Regex catches all of it?
Common shapes, yes. Edge cases (international phone formats, unusual emails) leak through. Production stacks layer regex with NER models or commercial PII detectors. For first-line defense, regex is 80% of the value at 5% of the effort.
And the model still understands the redacted text?
Usually. "Customer at [EMAIL] complained" → the model treats [EMAIL] as an opaque token but follows the rest. If the model needs to act on the email, you'd un-redact in your code after the model decides what to do.
import re
PATTERNS = {
"EMAIL": re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+'),
"PHONE": re.compile(r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b'),
"SSN": re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),
}
def redact(text):
out = text
for label, pat in PATTERNS.items():
out = pat.sub(f'[{label}]', out)
return out
def has_pii(text):
return any(p.search(text) for p in PATTERNS.values())Two public functions. redact rewrites; has_pii flags.
user_input = receive_user_message()
if has_pii(user_input):
log_pii_seen(user_input) # alert / metric
clean = redact(user_input)
response = Agent(model).run_sync(clean).outputRedact at the boundary — before the LLM sees the text. The redacted version is what gets logged, what flows through prompts, what's sent to third parties.
| Type | Regex sketch | Notes |
|---|---|---|
[\w.+-]+@[\w-]+\.[\w.-]+ | RFC-compliant emails are much more complex; this catches 95%+ | |
| US phone | \b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b | International numbers don't match — multi-region needs more patterns |
| US SSN | \b\d{3}-\d{2}-\d{4}\b | Strict format only; 9 raw digits also looks like an SSN |
| Credit card | Luhn algorithm + format | Beyond regex; use the creditcard libraries |
| Names, addresses | NER required | Regex can't reliably tag names — those need a small model |
For real PII compliance (GDPR, HIPAA), you'd combine:
For a track lesson we cover layer 1. The pattern composes — same redact/has_pii API, layered detectors inside.
Production rule: customer text often contains PII — emails, phone numbers, IDs. Sending it raw to a third-party LLM is a leak. Redact before send.
import re
PATTERNS = {
"EMAIL": r'[\w.+-]+@[\w-]+\.[\w.-]+',
"PHONE": r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b',
"SSN": r'\b\d{3}-\d{2}-\d{4}\b',
}
def redact(text):
for label, pattern in PATTERNS.items():
text = re.sub(pattern, f'[{label}]', text)
return text
print(redact("Email me at jane@acme.com or call 555-123-4567"))
# → Email me at [EMAIL] or call [PHONE]Regex catches all of it?
Common shapes, yes. Edge cases (international phone formats, unusual emails) leak through. Production stacks layer regex with NER models or commercial PII detectors. For first-line defense, regex is 80% of the value at 5% of the effort.
And the model still understands the redacted text?
Usually. "Customer at [EMAIL] complained" → the model treats [EMAIL] as an opaque token but follows the rest. If the model needs to act on the email, you'd un-redact in your code after the model decides what to do.
import re
PATTERNS = {
"EMAIL": re.compile(r'[\w.+-]+@[\w-]+\.[\w.-]+'),
"PHONE": re.compile(r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b'),
"SSN": re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),
}
def redact(text):
out = text
for label, pat in PATTERNS.items():
out = pat.sub(f'[{label}]', out)
return out
def has_pii(text):
return any(p.search(text) for p in PATTERNS.values())Two public functions. redact rewrites; has_pii flags.
user_input = receive_user_message()
if has_pii(user_input):
log_pii_seen(user_input) # alert / metric
clean = redact(user_input)
response = Agent(model).run_sync(clean).outputRedact at the boundary — before the LLM sees the text. The redacted version is what gets logged, what flows through prompts, what's sent to third parties.
| Type | Regex sketch | Notes |
|---|---|---|
[\w.+-]+@[\w-]+\.[\w.-]+ | RFC-compliant emails are much more complex; this catches 95%+ | |
| US phone | \b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b | International numbers don't match — multi-region needs more patterns |
| US SSN | \b\d{3}-\d{2}-\d{4}\b | Strict format only; 9 raw digits also looks like an SSN |
| Credit card | Luhn algorithm + format | Beyond regex; use the creditcard libraries |
| Names, addresses | NER required | Regex can't reliably tag names — those need a small model |
For real PII compliance (GDPR, HIPAA), you'd combine:
For a track lesson we cover layer 1. The pattern composes — same redact/has_pii API, layered detectors inside.
Create a free account to get started. Paid plans unlock all tracks.