You've learned re.search(), re.findall(), named groups, and re.sub(). Today we apply them to the exact shapes you'll encounter in real log data: timestamps, IPs, email addresses, and log line formats.
I've already got the IP pattern memorized — r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'. And I wrote the timestamp pattern yesterday: r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}'. But the ops team's older logs have timestamps in ISO 8601 format with a T and Z: 2026-04-07T09:14:33Z. Different format.
The pattern adjustment is small — T instead of space, optional Z at the end:
import re
# ISO 8601 timestamp: 2026-04-07T09:14:33Z
ts_iso = re.compile(r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z?')
# Legacy timestamp: 2026-04-07 09:14:33
ts_legacy = re.compile(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}')
# Match either format
ts_either = re.compile(r'\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}Z?')The character class [T ] matches either a T or a space. One pattern, two formats.
[T ] is a two-character class that matches either character. So I can write one compiled pattern that handles both the old and new timestamp format. The ops team has mixed logs from the migration period — this is exactly the problem I have right now.
Email addresses come up less often in server logs but frequently in application logs — user actions, auth events, notification sends. A practical email pattern isn't trying to cover every RFC 5322 edge case:
import re
# Practical: catches 99%+ of real email addresses in logs
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.]+')
log_line = "Auth failed for user maya.patel+ops@company.internal from 10.0.0.1"
emails = email_pattern.findall(log_line)
print(emails) # ['maya.patel+ops@company.internal'][\w.+-]+ — word characters, dots, plus signs, and hyphens before the @. [\w-]+ for the domain. \.[\w.]+ for the TLD which can itself have dots for things like .co.uk or .internal. That covers everything in the ops team's logs.
And when a single log line contains multiple extractable fields, re.findall() with groups gives you structured output directly:
import re
ERROR_EVENTS = re.compile(
r'(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z?)'
r'.*?(?P<level>ERROR|CRITICAL)'
r'.*?(?P<service>[\w-]+):'
)
logs = [
"2026-04-07T09:14:33Z ERROR auth: Token expired",
"2026-04-07T09:15:01Z INFO api: Request received",
"2026-04-07T09:16:22Z CRITICAL db: Connection pool exhausted",
]
for line in logs:
m = ERROR_EVENTS.search(line)
if m and m.group("level") in ("ERROR", "CRITICAL"):
print(m.groupdict())The .*? between fields is non-greedy — it skips the minimum number of characters needed to reach the next part of the pattern. So the pattern doesn't accidentally consume the service name while looking for the level.
Exactly. You're reading regex patterns and explaining non-greedy behavior. Three days ago you called it keyboard noise.
I still think the syntax is a crime. But I understand the crime now, and I can commit it when I need to. That's the important part.
One practical pattern you'll need for the analyzer — extracting HTTP status codes from access logs:
import re
# Apache/nginx access log: "GET /api/users HTTP/1.1" 200 342
access_pattern = re.compile(
r'"(?P<method>GET|POST|PUT|DELETE|PATCH) (?P<path>/[^"]*) HTTP/\d+\.\d+"'
r' (?P<status>\d{3})'
r' (?P<size>\d+|-)'
)[^"]* — any character that's not a double-quote, zero or more times. So it captures everything between the opening quote and the HTTP version, which is the full path including query strings. Smart. The | in the size field matches either digits or - for missing content-length.
Correct. The character class negation [^...] is one of the most useful tools for parsing delimited text — instead of saying what you want, you say what you don't want. It's cleaner than .+? followed by a lookahead in most cases.
I've got the eight metacharacters, named groups, substitution, non-greedy, and character class negation. I think I actually understand regex now. I don't love it. But I understand it.
Understanding is enough. Love is optional. Tomorrow we take a break from regex and look at the string module — template strings, text constants, and formatting utilities. Less powerful than regex, much more readable when you don't need pattern matching.
Real log files contain a consistent set of structured patterns: timestamps in one or two formats, IPv4 addresses, HTTP methods and status codes, user identifiers, and error codes. Learning to recognize and express these shapes in regex syntax — rather than trying to invent regex from first principles — covers 90% of log analysis work.
ISO 8601 timestamps (2026-04-07T09:14:33Z) and syslog-style timestamps (2026-04-07 09:14:33) differ only in the separator character. The character class [T ] handles both in a single pattern. For timestamps with fractional seconds (09:14:33.456), add (\.\d+)? — a non-capturing group with ? making it optional. For timezone offsets (+05:30), add ([+-]\d{2}:\d{2}|Z)?.
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b matches any four-octet dotted-decimal address. The word boundaries prevent matching octets that are part of longer numbers. This pattern does not validate that each octet is 0-255 — 999.999.999.999 matches. For log analysis, where invalid IPs are the minority and you'd rather flag them than silently drop them, this is the right tradeoff. Stricter validation belongs in a separate step after extraction.
[abc] matches any character in the set. [a-z] matches any lowercase letter. [^abc] matches any character NOT in the set. Character class negation is cleaner than non-greedy dot matching for parsing delimited content: [^"]* (everything except a double-quote) is more precise than .*?" (anything up to the next double-quote). In access log parsing, [^"]* for the request field is both faster and more correct.
The RFC 5322 specification for valid email addresses is notoriously complex — the full grammar covers quoted strings, comments, and internationalized addresses. In practice, [\w.+-]+@[\w-]+\.[\w.]+ matches the overwhelming majority of email addresses that actually appear in server logs. The philosophy: write the simplest pattern that matches your actual data, not the most theoretically complete pattern for all possible inputs.
When a pattern spans multiple fields separated by variable-length content, .*? (non-greedy dot-star) matches as little as possible before the next anchor. .* (greedy) matches as much as possible, potentially skipping over intermediate fields. For log lines where field order is fixed but field content length varies, .*? between anchors extracts exactly the fields you want without consuming the separators.
Sign up to write and run code in this lesson.
You've learned re.search(), re.findall(), named groups, and re.sub(). Today we apply them to the exact shapes you'll encounter in real log data: timestamps, IPs, email addresses, and log line formats.
I've already got the IP pattern memorized — r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'. And I wrote the timestamp pattern yesterday: r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}'. But the ops team's older logs have timestamps in ISO 8601 format with a T and Z: 2026-04-07T09:14:33Z. Different format.
The pattern adjustment is small — T instead of space, optional Z at the end:
import re
# ISO 8601 timestamp: 2026-04-07T09:14:33Z
ts_iso = re.compile(r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z?')
# Legacy timestamp: 2026-04-07 09:14:33
ts_legacy = re.compile(r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}')
# Match either format
ts_either = re.compile(r'\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}Z?')The character class [T ] matches either a T or a space. One pattern, two formats.
[T ] is a two-character class that matches either character. So I can write one compiled pattern that handles both the old and new timestamp format. The ops team has mixed logs from the migration period — this is exactly the problem I have right now.
Email addresses come up less often in server logs but frequently in application logs — user actions, auth events, notification sends. A practical email pattern isn't trying to cover every RFC 5322 edge case:
import re
# Practical: catches 99%+ of real email addresses in logs
email_pattern = re.compile(r'[\w.+-]+@[\w-]+\.[\w.]+')
log_line = "Auth failed for user maya.patel+ops@company.internal from 10.0.0.1"
emails = email_pattern.findall(log_line)
print(emails) # ['maya.patel+ops@company.internal'][\w.+-]+ — word characters, dots, plus signs, and hyphens before the @. [\w-]+ for the domain. \.[\w.]+ for the TLD which can itself have dots for things like .co.uk or .internal. That covers everything in the ops team's logs.
And when a single log line contains multiple extractable fields, re.findall() with groups gives you structured output directly:
import re
ERROR_EVENTS = re.compile(
r'(?P<timestamp>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z?)'
r'.*?(?P<level>ERROR|CRITICAL)'
r'.*?(?P<service>[\w-]+):'
)
logs = [
"2026-04-07T09:14:33Z ERROR auth: Token expired",
"2026-04-07T09:15:01Z INFO api: Request received",
"2026-04-07T09:16:22Z CRITICAL db: Connection pool exhausted",
]
for line in logs:
m = ERROR_EVENTS.search(line)
if m and m.group("level") in ("ERROR", "CRITICAL"):
print(m.groupdict())The .*? between fields is non-greedy — it skips the minimum number of characters needed to reach the next part of the pattern. So the pattern doesn't accidentally consume the service name while looking for the level.
Exactly. You're reading regex patterns and explaining non-greedy behavior. Three days ago you called it keyboard noise.
I still think the syntax is a crime. But I understand the crime now, and I can commit it when I need to. That's the important part.
One practical pattern you'll need for the analyzer — extracting HTTP status codes from access logs:
import re
# Apache/nginx access log: "GET /api/users HTTP/1.1" 200 342
access_pattern = re.compile(
r'"(?P<method>GET|POST|PUT|DELETE|PATCH) (?P<path>/[^"]*) HTTP/\d+\.\d+"'
r' (?P<status>\d{3})'
r' (?P<size>\d+|-)'
)[^"]* — any character that's not a double-quote, zero or more times. So it captures everything between the opening quote and the HTTP version, which is the full path including query strings. Smart. The | in the size field matches either digits or - for missing content-length.
Correct. The character class negation [^...] is one of the most useful tools for parsing delimited text — instead of saying what you want, you say what you don't want. It's cleaner than .+? followed by a lookahead in most cases.
I've got the eight metacharacters, named groups, substitution, non-greedy, and character class negation. I think I actually understand regex now. I don't love it. But I understand it.
Understanding is enough. Love is optional. Tomorrow we take a break from regex and look at the string module — template strings, text constants, and formatting utilities. Less powerful than regex, much more readable when you don't need pattern matching.
Real log files contain a consistent set of structured patterns: timestamps in one or two formats, IPv4 addresses, HTTP methods and status codes, user identifiers, and error codes. Learning to recognize and express these shapes in regex syntax — rather than trying to invent regex from first principles — covers 90% of log analysis work.
ISO 8601 timestamps (2026-04-07T09:14:33Z) and syslog-style timestamps (2026-04-07 09:14:33) differ only in the separator character. The character class [T ] handles both in a single pattern. For timestamps with fractional seconds (09:14:33.456), add (\.\d+)? — a non-capturing group with ? making it optional. For timezone offsets (+05:30), add ([+-]\d{2}:\d{2}|Z)?.
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b matches any four-octet dotted-decimal address. The word boundaries prevent matching octets that are part of longer numbers. This pattern does not validate that each octet is 0-255 — 999.999.999.999 matches. For log analysis, where invalid IPs are the minority and you'd rather flag them than silently drop them, this is the right tradeoff. Stricter validation belongs in a separate step after extraction.
[abc] matches any character in the set. [a-z] matches any lowercase letter. [^abc] matches any character NOT in the set. Character class negation is cleaner than non-greedy dot matching for parsing delimited content: [^"]* (everything except a double-quote) is more precise than .*?" (anything up to the next double-quote). In access log parsing, [^"]* for the request field is both faster and more correct.
The RFC 5322 specification for valid email addresses is notoriously complex — the full grammar covers quoted strings, comments, and internationalized addresses. In practice, [\w.+-]+@[\w-]+\.[\w.]+ matches the overwhelming majority of email addresses that actually appear in server logs. The philosophy: write the simplest pattern that matches your actual data, not the most theoretically complete pattern for all possible inputs.
When a pattern spans multiple fields separated by variable-length content, .*? (non-greedy dot-star) matches as little as possible before the next anchor. .* (greedy) matches as much as possible, potentially skipping over intermediate fields. For log lines where field order is fixed but field content length varies, .*? between anchors extracts exactly the fields you want without consuming the separators.