Yesterday you wrote five separate re.search() calls to extract five fields from a log line. Each one is independent and correct. Today we make it a single sweep with named groups.
Named groups — so instead of .group(1) for the timestamp and .group(2) for the level, I can do .group("timestamp") and .group("level")? The parentheses get a label?
Exactly. The syntax is (?P<name>pattern):
import re
LOG_PATTERN = re.compile(
r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})'
r' (?P<level>DEBUG|INFO|WARN|WARNING|ERROR|CRITICAL)'
r' \[(?P<service>[\w-]+)\]'
r' (?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
r' - (?P<message>.+)$'
)
log_line = "2026-04-07 09:14:33 ERROR [auth] 192.168.1.42 - Token expired"
match = LOG_PATTERN.search(log_line)
if match:
print(match.group("timestamp")) # 2026-04-07 09:14:33
print(match.group("level")) # ERROR
print(match.groupdict()) # all five fields as a dict.groupdict() gives me a Python dict directly? With the group names as keys? That's the return value I was building manually — now the match object builds it for me.
One sweep through the string, five labeled captures, one dict out. The pattern is split across two string literals for readability — Python concatenates adjacent string literals automatically. And re.compile() pre-parses the pattern so it doesn't re-parse on every log line.
For substitution — re.sub(). When I want to redact IP addresses from logs before sharing them with an external team, I replace every IP with a placeholder?
Exactly. re.sub() replaces every match:
import re
def redact_ips(log_text: str) -> str:
ip_pattern = r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'
return re.sub(ip_pattern, "[REDACTED]", log_text)
log = "Connection from 192.168.1.42 failed, retry from 10.0.0.1"
print(redact_ips(log))
# Connection from [REDACTED] failed, retry from [REDACTED]I've been doing this with a loop and str.replace(), but that required knowing all the IP addresses in advance. re.sub() finds and replaces any IP matching the pattern — I don't know the IPs ahead of time and I don't need to.
And the replacement can be a function, not just a string. You can transform each match individually — anonymize to a counter, truncate to the first two octets, anything. The replacement function receives the match object:
import re
counter = {}
def anonymize_ip(match):
ip = match.group(0)
if ip not in counter:
counter[ip] = f"IP_{len(counter) + 1}"
return counter[ip]
log = "192.168.1.42 -> 10.0.0.1, then 192.168.1.42 again"
print(re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', anonymize_ip, log))
# IP_1 -> IP_2, then IP_1 againThe same IP gets the same label throughout the log. Consistent anonymization without knowing the IPs up front. That's actually useful for sharing logs with vendors — they see patterns in the IP behavior without seeing the real addresses.
You just described a real security workflow. The regex finds the shape. The replacement function handles the logic. Today's problem puts both together: extract fields with named groups, apply substitutions to redact sensitive data, return the structured result.
One question — what if the log line doesn't fully match the combined pattern? Yesterday with five separate searches, each field degraded independently. With one big pattern, if one piece is wrong the whole match fails.
That's the tradeoff. Single-pattern is faster and gives you .groupdict() directly. Multi-search is more resilient to partial matches. For a controlled log format where you know the structure, single-pattern wins. For genuinely variable-format logs, per-field searches are safer. Choose based on how much you trust the format.
Named groups are the feature that transforms regex from a search tool into a data extraction tool. The distinction matters: finding a timestamp in a log line is a search problem. Extracting the timestamp, level, service, IP, and message simultaneously and returning them as a labelled dict is a data extraction problem. Named groups solve the latter cleanly.
The (?P<name>...) syntax labels a capturing group with a name. On a successful match, match.group("name") retrieves the captured text, and match.groupdict() returns all named groups as a single dict — exactly the shape you'd want to pass to the rest of your processing pipeline. Named groups are also self-documenting: (?P<timestamp>...) is readable in a way that (...) with a comment is not.
Sometimes you need grouping for alternation or repetition without capturing: (?:ERROR|WARNING) groups the alternation without creating a group number. This avoids polluting .group() indices with structural groups you don't need. Named groups capture; non-capturing groups (?:...) group without capturing.
re.sub(pattern, repl, string) replaces every match of pattern in string with repl. When repl is a string, it supports backreferences: \1 inserts the first group. When repl is a callable, it receives the match object for each replacement and returns the replacement string. This makes re.sub() capable of context-aware replacement — anonymizing IPs consistently, normalizing timestamp formats, redacting patterns based on surrounding context.
Quantifiers are greedy by default: .* matches as many characters as possible. Add ? to make them non-greedy: .*? matches as few characters as possible. For log messages that end with a pattern — r'ERROR.*error_code: (\d+)' — greedy matching can overshoot into the next log line if you're processing multi-line text. Non-greedy is the safer default for log parsing.
re.compile(pattern) returns a compiled pattern object with .search(), .findall(), .sub(), and .match() methods. Compilation parses the pattern string once. For a loop over 100,000 log lines, this eliminates 99,999 redundant parse operations. The performance difference is measurable — typically 15-30% faster for simple patterns, more for complex ones. Always compile patterns that are used more than once.
Sign up to write and run code in this lesson.
Yesterday you wrote five separate re.search() calls to extract five fields from a log line. Each one is independent and correct. Today we make it a single sweep with named groups.
Named groups — so instead of .group(1) for the timestamp and .group(2) for the level, I can do .group("timestamp") and .group("level")? The parentheses get a label?
Exactly. The syntax is (?P<name>pattern):
import re
LOG_PATTERN = re.compile(
r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})'
r' (?P<level>DEBUG|INFO|WARN|WARNING|ERROR|CRITICAL)'
r' \[(?P<service>[\w-]+)\]'
r' (?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'
r' - (?P<message>.+)$'
)
log_line = "2026-04-07 09:14:33 ERROR [auth] 192.168.1.42 - Token expired"
match = LOG_PATTERN.search(log_line)
if match:
print(match.group("timestamp")) # 2026-04-07 09:14:33
print(match.group("level")) # ERROR
print(match.groupdict()) # all five fields as a dict.groupdict() gives me a Python dict directly? With the group names as keys? That's the return value I was building manually — now the match object builds it for me.
One sweep through the string, five labeled captures, one dict out. The pattern is split across two string literals for readability — Python concatenates adjacent string literals automatically. And re.compile() pre-parses the pattern so it doesn't re-parse on every log line.
For substitution — re.sub(). When I want to redact IP addresses from logs before sharing them with an external team, I replace every IP with a placeholder?
Exactly. re.sub() replaces every match:
import re
def redact_ips(log_text: str) -> str:
ip_pattern = r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'
return re.sub(ip_pattern, "[REDACTED]", log_text)
log = "Connection from 192.168.1.42 failed, retry from 10.0.0.1"
print(redact_ips(log))
# Connection from [REDACTED] failed, retry from [REDACTED]I've been doing this with a loop and str.replace(), but that required knowing all the IP addresses in advance. re.sub() finds and replaces any IP matching the pattern — I don't know the IPs ahead of time and I don't need to.
And the replacement can be a function, not just a string. You can transform each match individually — anonymize to a counter, truncate to the first two octets, anything. The replacement function receives the match object:
import re
counter = {}
def anonymize_ip(match):
ip = match.group(0)
if ip not in counter:
counter[ip] = f"IP_{len(counter) + 1}"
return counter[ip]
log = "192.168.1.42 -> 10.0.0.1, then 192.168.1.42 again"
print(re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', anonymize_ip, log))
# IP_1 -> IP_2, then IP_1 againThe same IP gets the same label throughout the log. Consistent anonymization without knowing the IPs up front. That's actually useful for sharing logs with vendors — they see patterns in the IP behavior without seeing the real addresses.
You just described a real security workflow. The regex finds the shape. The replacement function handles the logic. Today's problem puts both together: extract fields with named groups, apply substitutions to redact sensitive data, return the structured result.
One question — what if the log line doesn't fully match the combined pattern? Yesterday with five separate searches, each field degraded independently. With one big pattern, if one piece is wrong the whole match fails.
That's the tradeoff. Single-pattern is faster and gives you .groupdict() directly. Multi-search is more resilient to partial matches. For a controlled log format where you know the structure, single-pattern wins. For genuinely variable-format logs, per-field searches are safer. Choose based on how much you trust the format.
Named groups are the feature that transforms regex from a search tool into a data extraction tool. The distinction matters: finding a timestamp in a log line is a search problem. Extracting the timestamp, level, service, IP, and message simultaneously and returning them as a labelled dict is a data extraction problem. Named groups solve the latter cleanly.
The (?P<name>...) syntax labels a capturing group with a name. On a successful match, match.group("name") retrieves the captured text, and match.groupdict() returns all named groups as a single dict — exactly the shape you'd want to pass to the rest of your processing pipeline. Named groups are also self-documenting: (?P<timestamp>...) is readable in a way that (...) with a comment is not.
Sometimes you need grouping for alternation or repetition without capturing: (?:ERROR|WARNING) groups the alternation without creating a group number. This avoids polluting .group() indices with structural groups you don't need. Named groups capture; non-capturing groups (?:...) group without capturing.
re.sub(pattern, repl, string) replaces every match of pattern in string with repl. When repl is a string, it supports backreferences: \1 inserts the first group. When repl is a callable, it receives the match object for each replacement and returns the replacement string. This makes re.sub() capable of context-aware replacement — anonymizing IPs consistently, normalizing timestamp formats, redacting patterns based on surrounding context.
Quantifiers are greedy by default: .* matches as many characters as possible. Add ? to make them non-greedy: .*? matches as few characters as possible. For log messages that end with a pattern — r'ERROR.*error_code: (\d+)' — greedy matching can overshoot into the next log line if you're processing multi-line text. Non-greedy is the safer default for log parsing.
re.compile(pattern) returns a compiled pattern object with .search(), .findall(), .sub(), and .match() methods. Compilation parses the pattern string once. For a loop over 100,000 log lines, this eliminates 99,999 redundant parse operations. The performance difference is measurable — typically 15-30% faster for simple patterns, more for complex ones. Always compile patterns that are used more than once.