The ops lead sent a question this morning: "What changed between last Tuesday's error log and today's?" How would you answer that right now?
Open both files, read them into lists, loop through side by side... but the line counts might be different if errors were added or removed. I'd need to track insertions and deletions separately. That's not a trivial algorithm.
It's not, and Python ships one. difflib is the standard library's sequence comparison module. Think of it as a proofreader comparing two drafts — it highlights what changed, what moved, what's similar. Here's the simplest form:
import difflib
old_errors = ["ERROR auth: Token expired", "ERROR db: Timeout"]
new_errors = ["ERROR auth: Token expired", "WARNING api: Slow response", "ERROR db: Timeout"]
diff = list(difflib.unified_diff(old_errors, new_errors,
lineterm="",
fromfile="tuesday.log",
tofile="today.log"))
for line in diff:
print(line)unified_diff is the same format as git diff? With the ---, +++, @@ headers and +/- prefixes? I've been staring at git diffs for two years. I didn't know Python could generate them.
The unified diff format was designed for humans to read quickly — + lines are additions, - lines are deletions, context lines have no prefix. difflib.unified_diff() generates it for any two sequences of strings. The fromfile and tofile arguments label the headers. lineterm="" strips newline characters from the output lines so they don't double up when you print them.
What if I don't want the diff format — I just want to know which lines are new, which were removed, and which stayed the same? The ops lead wants counts, not a visual diff.
difflib.SequenceMatcher is the underlying engine. It gives you the raw comparison operations:
import difflib
old = ["ERROR auth: Token expired", "ERROR db: Timeout", "INFO api: Started"]
new = ["ERROR auth: Token expired", "WARNING api: Slow", "ERROR db: Timeout"]
sm = difflib.SequenceMatcher(None, old, new)
for tag, i1, i2, j1, j2 in sm.get_opcodes():
print(tag, old[i1:i2], "->", new[j1:j2])The get_opcodes() method returns a list of tuples where tag is one of 'equal', 'replace', 'delete', or 'insert'. You walk through the operations to count additions, deletions, and unchanged lines.
So for the ops lead's question: I run SequenceMatcher on the two lists, count 'insert' operations for new error types, 'delete' for resolved ones, 'equal' for persistent issues. That's the summary he actually wants — not a visual diff.
Exactly. And difflib works on any sequences — not just lists of strings. You can compare character-by-character within a line:
import difflib
old_line = "ERROR [auth] 192.168.1.42: Token expired for maya.patel"
new_line = "ERROR [auth] 10.0.0.1: Token expired for ali.hassan"
sm = difflib.SequenceMatcher(None, old_line, new_line)
print(f"Similarity: {sm.ratio():.0%}") # how similar the two strings aresm.ratio() gives a similarity score between 0 and 1? So two log lines from the same error type but different users would have high similarity — same pattern, different details.
Correct. 1.0 is identical. 0.0 is completely different. For grouping similar error messages — "these twenty errors are all the same pattern, just different users" — difflib.get_close_matches() does fuzzy matching:
import difflib
error_types = ["Token expired", "Token invalid", "Token missing", "Connection refused"]
query = "Toekn expired" # typo
matches = difflib.get_close_matches(query, error_types, n=1, cutoff=0.6)
print(matches) # ['Token expired']get_close_matches even handles typos. That's useful for normalizing the ops team's hand-written error categories — they're inconsistent about capitalization and spelling and I have to group them manually. This could automate that.
It handles the fuzzy matching you'd otherwise do with regex or Levenshtein distance. Week 2 is done after today — you know re, string, and difflib. All three deal with text, but at different levels: regex finds patterns, string handles formatting, difflib compares sequences. Together they cover everything the ops team's raw text data throws at you.
The diff problem — given two sequences, describe how one becomes the other using the minimum number of insertions and deletions — is a classical computer science problem. Python's difflib module implements the Ratcliff/Obershelp algorithm, which emphasizes common subsequences rather than strict edit distance. The result is human-readable diffs that prioritize blocks of unchanged content.
difflib.unified_diff(a, b, fromfile, tofile, n=3) generates the standard unified diff format used by git diff, diff -u, and virtually every version control system. The n parameter controls context lines — the unchanged lines shown around each change block. n=0 shows only changed lines, no context. n=3 is the convention. The output is a generator of strings; convert to a list or join with newlines for display.
difflib.SequenceMatcher(isjunk, a, b) is the underlying comparison engine. The isjunk parameter is a function that returns True for elements that should be ignored when finding matches (whitespace, blank lines). None means nothing is treated as junk. Key methods: .ratio() returns a float 0.0-1.0 representing similarity. .get_matching_blocks() returns a list of matching subsequences. .get_opcodes() returns the minimal edit sequence as a list of (tag, i1, i2, j1, j2) tuples.
The four opcode tags represent the complete vocabulary of sequence transformation: 'equal' means a[i1:i2] == b[j1:j2]. 'replace' means a[i1:i2] was replaced by b[j1:j2]. 'delete' means a[i1:i2] was removed. 'insert' means b[j1:j2] was added. Walking the opcode list lets you count changes, filter by type, or reconstruct either sequence from the other.
difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6) returns the n closest matches from possibilities with similarity above cutoff. This is fuzzy matching without installing a third-party library. It handles typos, abbreviated names, and inconsistent capitalization. The cutoff of 0.6 is the standard threshold; lower values allow more dissimilar matches.
SequenceMatcher is O(n²) in the worst case. For large sequences — comparing two 10,000-line log files — it may be slow. The practical limit is a few hundred lines for interactive use. For larger files, use difflib.context_diff() with n=0 to limit output, or compare summaries (error type counts) rather than raw lines.
Sign up to write and run code in this lesson.
The ops lead sent a question this morning: "What changed between last Tuesday's error log and today's?" How would you answer that right now?
Open both files, read them into lists, loop through side by side... but the line counts might be different if errors were added or removed. I'd need to track insertions and deletions separately. That's not a trivial algorithm.
It's not, and Python ships one. difflib is the standard library's sequence comparison module. Think of it as a proofreader comparing two drafts — it highlights what changed, what moved, what's similar. Here's the simplest form:
import difflib
old_errors = ["ERROR auth: Token expired", "ERROR db: Timeout"]
new_errors = ["ERROR auth: Token expired", "WARNING api: Slow response", "ERROR db: Timeout"]
diff = list(difflib.unified_diff(old_errors, new_errors,
lineterm="",
fromfile="tuesday.log",
tofile="today.log"))
for line in diff:
print(line)unified_diff is the same format as git diff? With the ---, +++, @@ headers and +/- prefixes? I've been staring at git diffs for two years. I didn't know Python could generate them.
The unified diff format was designed for humans to read quickly — + lines are additions, - lines are deletions, context lines have no prefix. difflib.unified_diff() generates it for any two sequences of strings. The fromfile and tofile arguments label the headers. lineterm="" strips newline characters from the output lines so they don't double up when you print them.
What if I don't want the diff format — I just want to know which lines are new, which were removed, and which stayed the same? The ops lead wants counts, not a visual diff.
difflib.SequenceMatcher is the underlying engine. It gives you the raw comparison operations:
import difflib
old = ["ERROR auth: Token expired", "ERROR db: Timeout", "INFO api: Started"]
new = ["ERROR auth: Token expired", "WARNING api: Slow", "ERROR db: Timeout"]
sm = difflib.SequenceMatcher(None, old, new)
for tag, i1, i2, j1, j2 in sm.get_opcodes():
print(tag, old[i1:i2], "->", new[j1:j2])The get_opcodes() method returns a list of tuples where tag is one of 'equal', 'replace', 'delete', or 'insert'. You walk through the operations to count additions, deletions, and unchanged lines.
So for the ops lead's question: I run SequenceMatcher on the two lists, count 'insert' operations for new error types, 'delete' for resolved ones, 'equal' for persistent issues. That's the summary he actually wants — not a visual diff.
Exactly. And difflib works on any sequences — not just lists of strings. You can compare character-by-character within a line:
import difflib
old_line = "ERROR [auth] 192.168.1.42: Token expired for maya.patel"
new_line = "ERROR [auth] 10.0.0.1: Token expired for ali.hassan"
sm = difflib.SequenceMatcher(None, old_line, new_line)
print(f"Similarity: {sm.ratio():.0%}") # how similar the two strings aresm.ratio() gives a similarity score between 0 and 1? So two log lines from the same error type but different users would have high similarity — same pattern, different details.
Correct. 1.0 is identical. 0.0 is completely different. For grouping similar error messages — "these twenty errors are all the same pattern, just different users" — difflib.get_close_matches() does fuzzy matching:
import difflib
error_types = ["Token expired", "Token invalid", "Token missing", "Connection refused"]
query = "Toekn expired" # typo
matches = difflib.get_close_matches(query, error_types, n=1, cutoff=0.6)
print(matches) # ['Token expired']get_close_matches even handles typos. That's useful for normalizing the ops team's hand-written error categories — they're inconsistent about capitalization and spelling and I have to group them manually. This could automate that.
It handles the fuzzy matching you'd otherwise do with regex or Levenshtein distance. Week 2 is done after today — you know re, string, and difflib. All three deal with text, but at different levels: regex finds patterns, string handles formatting, difflib compares sequences. Together they cover everything the ops team's raw text data throws at you.
The diff problem — given two sequences, describe how one becomes the other using the minimum number of insertions and deletions — is a classical computer science problem. Python's difflib module implements the Ratcliff/Obershelp algorithm, which emphasizes common subsequences rather than strict edit distance. The result is human-readable diffs that prioritize blocks of unchanged content.
difflib.unified_diff(a, b, fromfile, tofile, n=3) generates the standard unified diff format used by git diff, diff -u, and virtually every version control system. The n parameter controls context lines — the unchanged lines shown around each change block. n=0 shows only changed lines, no context. n=3 is the convention. The output is a generator of strings; convert to a list or join with newlines for display.
difflib.SequenceMatcher(isjunk, a, b) is the underlying comparison engine. The isjunk parameter is a function that returns True for elements that should be ignored when finding matches (whitespace, blank lines). None means nothing is treated as junk. Key methods: .ratio() returns a float 0.0-1.0 representing similarity. .get_matching_blocks() returns a list of matching subsequences. .get_opcodes() returns the minimal edit sequence as a list of (tag, i1, i2, j1, j2) tuples.
The four opcode tags represent the complete vocabulary of sequence transformation: 'equal' means a[i1:i2] == b[j1:j2]. 'replace' means a[i1:i2] was replaced by b[j1:j2]. 'delete' means a[i1:i2] was removed. 'insert' means b[j1:j2] was added. Walking the opcode list lets you count changes, filter by type, or reconstruct either sequence from the other.
difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6) returns the n closest matches from possibilities with similarity above cutoff. This is fuzzy matching without installing a third-party library. It handles typos, abbreviated names, and inconsistent capitalization. The cutoff of 0.6 is the standard threshold; lower values allow more dissimilar matches.
SequenceMatcher is O(n²) in the worst case. For large sequences — comparing two 10,000-line log files — it may be slow. The practical limit is a few hundred lines for interactive use. For larger files, use difflib.context_diff() with n=0 to limit output, or compare summaries (error type counts) rather than raw lines.