Today we learn regular expressions. Fair warning: the syntax looks like someone's cat walked across the keyboard.
That bad?
Here's a pattern that matches an IP address: r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
What. The \d is a digit, I can guess that. The {1,3} means one to three digits? And \. is a literal dot because a bare . means any character?
Exactly right. Think of the re module like a metal detector. You describe the shape of what you're looking for — four groups of one-to-three digits separated by literal dots — and re sweeps through the text finding every match. re.findall() returns all of them:
import re
log_line = "Connection from 192.168.1.42 failed, retry from 10.0.0.1"
ips = re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', log_line)
print(ips) # ['192.168.1.42', '10.0.0.1']Two lines. TWO LINES. I wrote a thirty-line function to extract IPs from log files using split() and isdigit(). It broke on the first IP with a port number appended. Thirty lines, Sam.
And now it's two. Welcome to the standard library. The re module has four functions you'll use constantly. re.findall() returns all matches. re.search() finds the first match and returns a match object — or None if nothing is found:
import re
log_line = "[2026-04-07T09:14:33Z] ERROR auth: Token expired for 192.168.1.42"
match = re.search(r'ERROR|WARNING|INFO|DEBUG', log_line)
if match:
print(match.group()) # ERROR
print(match.start()) # 22 (position in string)So re.search() returns a match object — call .group() for the actual text. If there's no match it returns None, so I can use it directly in an if statement. And the | is alternation — it matches any of those options?
| is alternation, yes. Here are the eight metacharacters that cover 90% of log parsing — \d (digit), \w (word character), \s (whitespace), . (any char), + (one or more), * (zero or more), {n,m} (between n and m), [] (character class). For structured log lines with a consistent format, here is the complete extraction pattern:
import re
def extract_log_fields(log_line: str) -> dict:
timestamp = re.search(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})', log_line)
level = re.search(r'\b(DEBUG|INFO|WARN|WARNING|ERROR|CRITICAL)\b', log_line)
service = re.search(r'\[([\w-]+)\]', log_line)
ip = re.search(r'\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\b', log_line)
message = re.search(r' - (.+)$', log_line)
return {
"timestamp": timestamp.group(1) if timestamp else "",
"level": level.group(1) if level else "",
"service": service.group(1) if service else "",
"ip": ip.group(1) if ip else "",
"message": message.group(1) if message else "",
}Each call is completely independent — timestamp doesn't care about IP doesn't care about level. I can add or remove fields without touching the others. My thirty-line string-splitting function was one fragile block — move one field and everything breaks. This is composable.
You just described why regex beats string-splitting for variable-format text. split() assumes fixed structure. Regex assumes only shape. Log formats change — the shapes stay. The \b around the level pattern is a word boundary — prevents ERROR from matching inside ERRORCODE.
Today's version uses one re.search() per field. Tomorrow you said we'd rewrite it as a single search with named groups — one sweep, five labeled evidence bags. I want to see that.
Tomorrow. Today: one pattern per field, empty string for missing matches, return a dict. The metal detector gets evidence bags tomorrow. Today you learn to use it.
Regular expressions have a reputation for being cryptic. That reputation is deserved — the syntax is a 1950s artifact that has barely changed. But the underlying model is both simple and powerful: describe the shape of what you're looking for, and the engine finds every instance in the text. For log analysis, where the data is unstructured text with embedded structured information, regex is the right tool.
re.search(pattern, string) finds the first match anywhere in the string and returns a match object, or None. re.findall(pattern, string) returns all non-overlapping matches as a list of strings (or a list of tuples when the pattern has groups). re.match(pattern, string) anchors to the start of the string — rarely what you want for log lines. re.sub(pattern, replacement, string) replaces all matches — useful for redacting IPs or normalizing timestamps.
Always use raw strings for regex patterns: r'\d+' instead of '\d+'. Without the r prefix, '\d+' is a two-character string (backslash + d) and Python string escaping fights with regex escaping. With r'\d+', Python passes the characters \d+ unchanged to the regex engine. One rule: regex patterns are raw strings.
When re.search() finds a match, it returns a re.Match object. .group(0) (or .group()) returns the entire match. .group(1) returns the first capturing group (...). .group(2) returns the second, and so on. .start() and .end() give the indices of the match in the original string. Always check if match: before calling .group() — calling .group() on None raises AttributeError.
The \b metacharacter matches the zero-width position between a word character and a non-word character. re.search(r'\bERROR\b', 'ERRORCODE') returns None because ERROR in ERRORCODE has no word boundary after the R. This is essential for log level matching — without \b, INFO matches inside INFORMATION.
When a pattern is used in a loop over thousands of log lines, re.compile() pre-compiles the pattern into a regex object. compiled = re.compile(r'pattern'); compiled.search(line) avoids re-parsing the pattern string on every call. The performance gain is significant for high-volume log processing. The compiled object has the same .search(), .findall(), .sub() methods as the module-level functions.
Sign up to write and run code in this lesson.
Today we learn regular expressions. Fair warning: the syntax looks like someone's cat walked across the keyboard.
That bad?
Here's a pattern that matches an IP address: r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
What. The \d is a digit, I can guess that. The {1,3} means one to three digits? And \. is a literal dot because a bare . means any character?
Exactly right. Think of the re module like a metal detector. You describe the shape of what you're looking for — four groups of one-to-three digits separated by literal dots — and re sweeps through the text finding every match. re.findall() returns all of them:
import re
log_line = "Connection from 192.168.1.42 failed, retry from 10.0.0.1"
ips = re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', log_line)
print(ips) # ['192.168.1.42', '10.0.0.1']Two lines. TWO LINES. I wrote a thirty-line function to extract IPs from log files using split() and isdigit(). It broke on the first IP with a port number appended. Thirty lines, Sam.
And now it's two. Welcome to the standard library. The re module has four functions you'll use constantly. re.findall() returns all matches. re.search() finds the first match and returns a match object — or None if nothing is found:
import re
log_line = "[2026-04-07T09:14:33Z] ERROR auth: Token expired for 192.168.1.42"
match = re.search(r'ERROR|WARNING|INFO|DEBUG', log_line)
if match:
print(match.group()) # ERROR
print(match.start()) # 22 (position in string)So re.search() returns a match object — call .group() for the actual text. If there's no match it returns None, so I can use it directly in an if statement. And the | is alternation — it matches any of those options?
| is alternation, yes. Here are the eight metacharacters that cover 90% of log parsing — \d (digit), \w (word character), \s (whitespace), . (any char), + (one or more), * (zero or more), {n,m} (between n and m), [] (character class). For structured log lines with a consistent format, here is the complete extraction pattern:
import re
def extract_log_fields(log_line: str) -> dict:
timestamp = re.search(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})', log_line)
level = re.search(r'\b(DEBUG|INFO|WARN|WARNING|ERROR|CRITICAL)\b', log_line)
service = re.search(r'\[([\w-]+)\]', log_line)
ip = re.search(r'\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\b', log_line)
message = re.search(r' - (.+)$', log_line)
return {
"timestamp": timestamp.group(1) if timestamp else "",
"level": level.group(1) if level else "",
"service": service.group(1) if service else "",
"ip": ip.group(1) if ip else "",
"message": message.group(1) if message else "",
}Each call is completely independent — timestamp doesn't care about IP doesn't care about level. I can add or remove fields without touching the others. My thirty-line string-splitting function was one fragile block — move one field and everything breaks. This is composable.
You just described why regex beats string-splitting for variable-format text. split() assumes fixed structure. Regex assumes only shape. Log formats change — the shapes stay. The \b around the level pattern is a word boundary — prevents ERROR from matching inside ERRORCODE.
Today's version uses one re.search() per field. Tomorrow you said we'd rewrite it as a single search with named groups — one sweep, five labeled evidence bags. I want to see that.
Tomorrow. Today: one pattern per field, empty string for missing matches, return a dict. The metal detector gets evidence bags tomorrow. Today you learn to use it.
Regular expressions have a reputation for being cryptic. That reputation is deserved — the syntax is a 1950s artifact that has barely changed. But the underlying model is both simple and powerful: describe the shape of what you're looking for, and the engine finds every instance in the text. For log analysis, where the data is unstructured text with embedded structured information, regex is the right tool.
re.search(pattern, string) finds the first match anywhere in the string and returns a match object, or None. re.findall(pattern, string) returns all non-overlapping matches as a list of strings (or a list of tuples when the pattern has groups). re.match(pattern, string) anchors to the start of the string — rarely what you want for log lines. re.sub(pattern, replacement, string) replaces all matches — useful for redacting IPs or normalizing timestamps.
Always use raw strings for regex patterns: r'\d+' instead of '\d+'. Without the r prefix, '\d+' is a two-character string (backslash + d) and Python string escaping fights with regex escaping. With r'\d+', Python passes the characters \d+ unchanged to the regex engine. One rule: regex patterns are raw strings.
When re.search() finds a match, it returns a re.Match object. .group(0) (or .group()) returns the entire match. .group(1) returns the first capturing group (...). .group(2) returns the second, and so on. .start() and .end() give the indices of the match in the original string. Always check if match: before calling .group() — calling .group() on None raises AttributeError.
The \b metacharacter matches the zero-width position between a word character and a non-word character. re.search(r'\bERROR\b', 'ERRORCODE') returns None because ERROR in ERRORCODE has no word boundary after the R. This is essential for log level matching — without \b, INFO matches inside INFORMATION.
When a pattern is used in a loop over thousands of log lines, re.compile() pre-compiles the pattern into a regex object. compiled = re.compile(r'pattern'); compiled.search(line) avoids re-parsing the pattern string on every call. The performance gain is significant for high-volume log processing. The compiled object has the same .search(), .findall(), .sub() methods as the module-level functions.