Capstone. You've built every component. Today you wire them into a complete CLI log analyzer. Before we start — what modules do you expect you'll need?
Let me think through the pipeline. json to parse the log entries. re for matching error patterns in the messages. datetime and timedelta to filter by date range. collections.Counter to count error types and find the most common ones. argparse for the CLI interface — log file path, date range, minimum level, output format. logging so the tool itself logs what it's doing rather than printing to stdout. And pathlib to handle the file path argument cleanly.
Seven stdlib modules without looking anything up. Three weeks ago you were copy-pasting import json from Stack Overflow.
To be fair, I still look up strftime vs strptime every time. Those names are crimes against humanity.
Crimes I've committed for fifteen years. Here's the architecture we're building:
import argparse, json, re, logging
from datetime import datetime, timedelta
from collections import Counter
from pathlib import Path
def make_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Analyze server log files.")
parser.add_argument("log_file", type=Path, help="Path to JSON log file")
parser.add_argument("--hours", type=int, default=24, help="Look back N hours")
parser.add_argument("--level", default="ERROR",
choices=["DEBUG","INFO","WARNING","ERROR","CRITICAL"])
parser.add_argument("--pattern", default=None, help="Regex pattern to match")
parser.add_argument("--verbose", action="store_true")
return parsertype=Path in argparse — it converts the string argument to a Path object directly. So args.log_file is already a Path, not a string. I can call args.log_file.exists() and args.log_file.open() immediately.
type= in argparse accepts any callable — including Path. Invalid paths that don't exist won't fail at parse time (argparse doesn't check existence), but your code can validate after parsing. The analysis pipeline itself:
def analyze_logs(log_file: Path, hours: int, level: str, pattern: str | None) -> dict:
logger = logging.getLogger("log-analyzer")
cutoff = datetime.utcnow() - timedelta(hours=hours)
compiled = re.compile(pattern) if pattern else None
entries, skipped = [], 0
for line in log_file.read_text().splitlines():
line = line.strip()
if not line:
continue
try:
entry = json.loads(line)
except json.JSONDecodeError:
skipped += 1
continue
ts_str = entry.get("timestamp", "")
try:
ts = datetime.fromisoformat(ts_str.replace("Z", ""))
except ValueError:
skipped += 1
continue
if ts < cutoff:
continue
if entry.get("level", "") != level:
continue
if compiled and not compiled.search(entry.get("message", "")):
continue
entries.append(entry)
logger.info(f"Parsed {len(entries)} matching entries, skipped {skipped}")
return {
"total": len(entries),
"skipped": skipped,
"top_services": Counter(e["service"] for e in entries).most_common(5),
"entries": entries,
}The whole pipeline in one function. json.loads() with try/except for each line. datetime.fromisoformat() for the timestamp. Date-range filter with < cutoff. Level filter with string equality. Optional regex match with the compiled pattern. Counter.most_common(5) for the service frequency. Every tool from the last four weeks, in order.
And the function is testable without the filesystem — you can call it with a mock Path or test each stage independently. That's the architecture you described in Week 1 when you said "compute first, act second."
The main() function would parse args, configure logging based on --verbose, call analyze_logs(), and format the output. The tool doesn't know how the output is used — it just returns a dict. The main() function decides whether to print JSON or a human-readable report.
You just described separation of concerns without me asking. Analysis logic separate from I/O. Testable core, thin shell. Maya, that is software engineering, not scripting.
Today's problem: implement analyze_logs() as a pure function that takes log text (not a file path) and the filter parameters. It returns the summary dict. No argparse, no logging setup — just the analysis core that the CLI wraps.
Pure function, no side effects, all the stdlib tools. Write it so the ops team can rely on it — handle malformed JSON, handle missing fields, handle empty results. The CLI wrapper is just scaffolding. This function is the product.
This is what Diane's been asking for since the start of the track. A tool that works. Not a script that requires me to be there. Four weeks ago I didn't know what pathlib was. Now I'm building a log analyzer with seven standard library modules.
And you built it. This is the capstone. One week from now, the ops team runs this from cron, Diane gets her error report by lunch, and you never have to manually parse a log file again. That's the entire point of the standard library — Python ships with the toolkit. You just learned to use it.
The capstone lesson represents a specific transition: from using Python to building tools with Python. The difference is architectural — a tool has an explicit interface (argparse), professional output (logging), error handling at every boundary (json.JSONDecodeError, ValueError), and a testable core that is separate from its I/O wrapper.
The most important design decision in any CLI tool is separating the analysis core from the I/O shell. analyze_logs(log_text: str, hours: int, level: str, pattern: str | None) -> dict is callable from tests, from other scripts, from a web endpoint, or from a Jupyter notebook — without subprocess, without files, without argparse. The CLI is just one way to provide inputs to this function.
This separation enables testing at the right level: unit tests for the analysis logic, integration tests for the CLI. The analysis logic tests don't need temp files or subprocess calls — they pass strings and check dicts.
A production log analyzer must handle bad input at every stage. json.JSONDecodeError for malformed log lines. ValueError for unparseable timestamps. KeyError for missing required fields (use .get() with defaults). re.error for invalid regex patterns (validate at argparse time with a custom type function). Each boundary failure should be counted and logged, not silently swallowed — skipped: N in the output tells users their input has issues.
The seven modules used in the capstone form a complete data pipeline: pathlib for file handling, json for parsing, datetime and timedelta for time filtering, re for pattern matching, collections.Counter for frequency analysis, argparse for the CLI interface, and logging for operational output. These seven modules handle every concern except network I/O and database access — which require additional libraries but use the same architectural patterns.
The standard library covered in this track — json, csv, pathlib, os, sys, shutil, re, string, difflib, datetime, math, statistics, random, collections, argparse, logging, pprint, timeit, glob, fnmatch — is roughly the set of modules that every Python developer uses regularly. Beyond these, the standard library has 200+ additional modules for network protocols, data compression, cryptography, HTML parsing, email, XML, concurrent execution, and more.
The instinct to check the standard library before pip-installing is the skill this track was designed to build. You have it now. Before the next pip install, ask: does Python ship this?
Sign up to write and run code in this lesson.
Capstone. You've built every component. Today you wire them into a complete CLI log analyzer. Before we start — what modules do you expect you'll need?
Let me think through the pipeline. json to parse the log entries. re for matching error patterns in the messages. datetime and timedelta to filter by date range. collections.Counter to count error types and find the most common ones. argparse for the CLI interface — log file path, date range, minimum level, output format. logging so the tool itself logs what it's doing rather than printing to stdout. And pathlib to handle the file path argument cleanly.
Seven stdlib modules without looking anything up. Three weeks ago you were copy-pasting import json from Stack Overflow.
To be fair, I still look up strftime vs strptime every time. Those names are crimes against humanity.
Crimes I've committed for fifteen years. Here's the architecture we're building:
import argparse, json, re, logging
from datetime import datetime, timedelta
from collections import Counter
from pathlib import Path
def make_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Analyze server log files.")
parser.add_argument("log_file", type=Path, help="Path to JSON log file")
parser.add_argument("--hours", type=int, default=24, help="Look back N hours")
parser.add_argument("--level", default="ERROR",
choices=["DEBUG","INFO","WARNING","ERROR","CRITICAL"])
parser.add_argument("--pattern", default=None, help="Regex pattern to match")
parser.add_argument("--verbose", action="store_true")
return parsertype=Path in argparse — it converts the string argument to a Path object directly. So args.log_file is already a Path, not a string. I can call args.log_file.exists() and args.log_file.open() immediately.
type= in argparse accepts any callable — including Path. Invalid paths that don't exist won't fail at parse time (argparse doesn't check existence), but your code can validate after parsing. The analysis pipeline itself:
def analyze_logs(log_file: Path, hours: int, level: str, pattern: str | None) -> dict:
logger = logging.getLogger("log-analyzer")
cutoff = datetime.utcnow() - timedelta(hours=hours)
compiled = re.compile(pattern) if pattern else None
entries, skipped = [], 0
for line in log_file.read_text().splitlines():
line = line.strip()
if not line:
continue
try:
entry = json.loads(line)
except json.JSONDecodeError:
skipped += 1
continue
ts_str = entry.get("timestamp", "")
try:
ts = datetime.fromisoformat(ts_str.replace("Z", ""))
except ValueError:
skipped += 1
continue
if ts < cutoff:
continue
if entry.get("level", "") != level:
continue
if compiled and not compiled.search(entry.get("message", "")):
continue
entries.append(entry)
logger.info(f"Parsed {len(entries)} matching entries, skipped {skipped}")
return {
"total": len(entries),
"skipped": skipped,
"top_services": Counter(e["service"] for e in entries).most_common(5),
"entries": entries,
}The whole pipeline in one function. json.loads() with try/except for each line. datetime.fromisoformat() for the timestamp. Date-range filter with < cutoff. Level filter with string equality. Optional regex match with the compiled pattern. Counter.most_common(5) for the service frequency. Every tool from the last four weeks, in order.
And the function is testable without the filesystem — you can call it with a mock Path or test each stage independently. That's the architecture you described in Week 1 when you said "compute first, act second."
The main() function would parse args, configure logging based on --verbose, call analyze_logs(), and format the output. The tool doesn't know how the output is used — it just returns a dict. The main() function decides whether to print JSON or a human-readable report.
You just described separation of concerns without me asking. Analysis logic separate from I/O. Testable core, thin shell. Maya, that is software engineering, not scripting.
Today's problem: implement analyze_logs() as a pure function that takes log text (not a file path) and the filter parameters. It returns the summary dict. No argparse, no logging setup — just the analysis core that the CLI wraps.
Pure function, no side effects, all the stdlib tools. Write it so the ops team can rely on it — handle malformed JSON, handle missing fields, handle empty results. The CLI wrapper is just scaffolding. This function is the product.
This is what Diane's been asking for since the start of the track. A tool that works. Not a script that requires me to be there. Four weeks ago I didn't know what pathlib was. Now I'm building a log analyzer with seven standard library modules.
And you built it. This is the capstone. One week from now, the ops team runs this from cron, Diane gets her error report by lunch, and you never have to manually parse a log file again. That's the entire point of the standard library — Python ships with the toolkit. You just learned to use it.
The capstone lesson represents a specific transition: from using Python to building tools with Python. The difference is architectural — a tool has an explicit interface (argparse), professional output (logging), error handling at every boundary (json.JSONDecodeError, ValueError), and a testable core that is separate from its I/O wrapper.
The most important design decision in any CLI tool is separating the analysis core from the I/O shell. analyze_logs(log_text: str, hours: int, level: str, pattern: str | None) -> dict is callable from tests, from other scripts, from a web endpoint, or from a Jupyter notebook — without subprocess, without files, without argparse. The CLI is just one way to provide inputs to this function.
This separation enables testing at the right level: unit tests for the analysis logic, integration tests for the CLI. The analysis logic tests don't need temp files or subprocess calls — they pass strings and check dicts.
A production log analyzer must handle bad input at every stage. json.JSONDecodeError for malformed log lines. ValueError for unparseable timestamps. KeyError for missing required fields (use .get() with defaults). re.error for invalid regex patterns (validate at argparse time with a custom type function). Each boundary failure should be counted and logged, not silently swallowed — skipped: N in the output tells users their input has issues.
The seven modules used in the capstone form a complete data pipeline: pathlib for file handling, json for parsing, datetime and timedelta for time filtering, re for pattern matching, collections.Counter for frequency analysis, argparse for the CLI interface, and logging for operational output. These seven modules handle every concern except network I/O and database access — which require additional libraries but use the same architectural patterns.
The standard library covered in this track — json, csv, pathlib, os, sys, shutil, re, string, difflib, datetime, math, statistics, random, collections, argparse, logging, pprint, timeit, glob, fnmatch — is roughly the set of modules that every Python developer uses regularly. Beyond these, the standard library has 200+ additional modules for network protocols, data compression, cryptography, HTML parsing, email, XML, concurrent execution, and more.
The instinct to check the standard library before pip-installing is the skill this track was designed to build. You have it now. Before the next pip install, ask: does Python ship this?