Yesterday timeit showed you that the loop you optimized twice was not the problem. Today you find out what was.
I ran timeit on build_summary_dict after our session. Two milliseconds. Practically instant. The script still runs for 28 seconds. The time is somewhere I am not looking.
That is the most important thing you could have discovered. The piece you suspected is fast. The bottleneck is elsewhere. This is not a saying — it is a statistical observation about how developers reason about performance. We suspect the loops. We are wrong, routinely. cProfile is how you stop guessing:
import cProfile
cProfile.run("process_inventory(records)", "profile.stats")
# or: python -m cProfile -o profile.stats inventory_script.pycProfile instruments your entire execution, records every function call, and writes a stats file. You then read it:
import pstats
from pstats import SortKey
p = pstats.Stats("profile.stats")
p.sort_stats(SortKey.CUMULATIVE)
p.print_stats(10)Every function call, timed. Sort by cumulative time, find the expensive chains. What does the output actually look like for my script?
Here is a representative sample:
ncalls tottime cumtime filename:lineno(function)
1 0.003 27.942 inventory_script.py:45(process_inventory)
4821 0.041 27.831 inventory_script.py:78(enrich_record)
4821 27.612 27.612 inventory_script.py:92(fetch_category_name)
4821 0.121 0.178 inventory_script.py:112(build_summary_dict)What do you see?
fetch_category_name — 27.6 seconds. Called 4821 times. Once per record. And build_summary_dict — the function I was benchmarking — 0.178 seconds total. I was optimizing something that uses less than one percent of the runtime.
The doctor analogy: you were treating the patient's resting heart rate while their cholesterol was catastrophic. You cannot see that with a stethoscope. You need the blood panel.
Three hours. I spent three hours benchmarking build_summary_dict. The real problem was a database call inside a loop that I wrote in five minutes and never looked at again.
Every engineer who first runs cProfile on code they thought they understood has this moment. The bottleneck is where you were not looking. Open fetch_category_name. What does it do?
It queries the database. Every. Single. Call. There are only twelve categories. I am querying the same twelve categories 4,821 times.
The fix is not to optimize the query. Run it once before the loop. Twelve rows. Four thousand lookups into a dict that costs nanoseconds each. Twenty-eight seconds becomes under one second. Now — tottime versus cumtime. When do you sort by tottime instead of cumtime?
cumtime includes everything a function called. tottime is just the function's own code. So sort by cumtime first to find the expensive chains, then by tottime within the chain to find which function inside it is actually doing the slow work.
Exactly. process_inventory has high cumtime and low tottime — the time is in what it calls. Follow the chain: enrich_record, then fetch_category_name. High tottime on fetch_category_name tells you that is where the work happens. Two sorts, full diagnosis.
What about the pstats API beyond sorting? I want to trace which functions are calling fetch_category_name — maybe it is not just enrich_record calling it.
print_callers and print_callees give you the call graph:
p.print_callers("fetch_category_name") # who calls this function and how many times
p.print_callees("process_inventory") # what this function callsWhen a function shows up with high ncalls, print_callers shows you which paths are driving the volume. If three different modules all call fetch_category_name, the fix is not local — it is a shared cache that every module draws from.
So cProfile tells me which function is slow, and print_callers tells me who is causing it to be called so often. The diagnosis is complete: which function, at what cost, called by whom. Then timeit verifies that my fix actually made it faster before it ships.
That is the full workflow. Bring the inventory script tomorrow. One command, five seconds, a sorted table. The bottleneck announces itself.
cProfile is a deterministic profiler — it records every function call, not a statistical sample. Understanding the instrumentation model explains both its accuracy and its overhead.
The two time columns. tottime (total time) is the time spent in a function's own code, excluding time in functions it called. cumtime (cumulative time) is tottime plus all time in functions the profiled function called, recursively. For a function like process_inventory that mostly calls other functions, tottime is near zero and cumtime is the total cost of the call chain. For a leaf function like fetch_category_name that makes a database call, tottime and cumtime are nearly equal because it does not call much else. The diagnostic sequence: sort by cumtime to identify the expensive subtree, sort by tottime within that subtree to identify the specific function doing the slow work.
ncalls and the per-call cost. ncalls is the call count. Dividing tottime by ncalls gives the per-call cost. When you see ncalls=4821 and tottime=27.6, the per-call cost is 5.7 milliseconds. The fix — query once, cache the results, look up from a dict — reduces the per-call cost from 5.7ms to a few nanoseconds. The total savings is (5.7ms - 0.000005ms) × 4821 ≈ 27.5 seconds. That arithmetic is only visible when you have ncalls in front of you.
cProfile overhead and when to use pstats filtering. cProfile adds overhead per function call — roughly 1-5 microseconds per call depending on the platform. For functions called millions of times, this adds up and can shift which function appears as the bottleneck. In practice, the overhead is rarely large enough to misidentify a genuine bottleneck, but the absolute numbers should be treated as approximations. The pstats.Stats API supports filtering: p.print_stats("your_module") shows only functions from your code, excluding stdlib and third-party library calls from the view. p.print_callers("fetch_category_name") shows which functions are calling the bottleneck and how many times each caller is responsible for.
The N+1 query pattern. Querying a database once per loop iteration is the most common performance issue that cProfile exposes in web and data applications. The fix is always the same: identify the set of distinct keys needed, query once for all of them, load the results into a dict, and replace the per-iteration query with a dict lookup. This pattern — query in bulk, cache in memory, look up locally — is how every ORM's select_related/prefetch_related feature works under the hood.
Sign up to write and run code in this lesson.
Yesterday timeit showed you that the loop you optimized twice was not the problem. Today you find out what was.
I ran timeit on build_summary_dict after our session. Two milliseconds. Practically instant. The script still runs for 28 seconds. The time is somewhere I am not looking.
That is the most important thing you could have discovered. The piece you suspected is fast. The bottleneck is elsewhere. This is not a saying — it is a statistical observation about how developers reason about performance. We suspect the loops. We are wrong, routinely. cProfile is how you stop guessing:
import cProfile
cProfile.run("process_inventory(records)", "profile.stats")
# or: python -m cProfile -o profile.stats inventory_script.pycProfile instruments your entire execution, records every function call, and writes a stats file. You then read it:
import pstats
from pstats import SortKey
p = pstats.Stats("profile.stats")
p.sort_stats(SortKey.CUMULATIVE)
p.print_stats(10)Every function call, timed. Sort by cumulative time, find the expensive chains. What does the output actually look like for my script?
Here is a representative sample:
ncalls tottime cumtime filename:lineno(function)
1 0.003 27.942 inventory_script.py:45(process_inventory)
4821 0.041 27.831 inventory_script.py:78(enrich_record)
4821 27.612 27.612 inventory_script.py:92(fetch_category_name)
4821 0.121 0.178 inventory_script.py:112(build_summary_dict)What do you see?
fetch_category_name — 27.6 seconds. Called 4821 times. Once per record. And build_summary_dict — the function I was benchmarking — 0.178 seconds total. I was optimizing something that uses less than one percent of the runtime.
The doctor analogy: you were treating the patient's resting heart rate while their cholesterol was catastrophic. You cannot see that with a stethoscope. You need the blood panel.
Three hours. I spent three hours benchmarking build_summary_dict. The real problem was a database call inside a loop that I wrote in five minutes and never looked at again.
Every engineer who first runs cProfile on code they thought they understood has this moment. The bottleneck is where you were not looking. Open fetch_category_name. What does it do?
It queries the database. Every. Single. Call. There are only twelve categories. I am querying the same twelve categories 4,821 times.
The fix is not to optimize the query. Run it once before the loop. Twelve rows. Four thousand lookups into a dict that costs nanoseconds each. Twenty-eight seconds becomes under one second. Now — tottime versus cumtime. When do you sort by tottime instead of cumtime?
cumtime includes everything a function called. tottime is just the function's own code. So sort by cumtime first to find the expensive chains, then by tottime within the chain to find which function inside it is actually doing the slow work.
Exactly. process_inventory has high cumtime and low tottime — the time is in what it calls. Follow the chain: enrich_record, then fetch_category_name. High tottime on fetch_category_name tells you that is where the work happens. Two sorts, full diagnosis.
What about the pstats API beyond sorting? I want to trace which functions are calling fetch_category_name — maybe it is not just enrich_record calling it.
print_callers and print_callees give you the call graph:
p.print_callers("fetch_category_name") # who calls this function and how many times
p.print_callees("process_inventory") # what this function callsWhen a function shows up with high ncalls, print_callers shows you which paths are driving the volume. If three different modules all call fetch_category_name, the fix is not local — it is a shared cache that every module draws from.
So cProfile tells me which function is slow, and print_callers tells me who is causing it to be called so often. The diagnosis is complete: which function, at what cost, called by whom. Then timeit verifies that my fix actually made it faster before it ships.
That is the full workflow. Bring the inventory script tomorrow. One command, five seconds, a sorted table. The bottleneck announces itself.
cProfile is a deterministic profiler — it records every function call, not a statistical sample. Understanding the instrumentation model explains both its accuracy and its overhead.
The two time columns. tottime (total time) is the time spent in a function's own code, excluding time in functions it called. cumtime (cumulative time) is tottime plus all time in functions the profiled function called, recursively. For a function like process_inventory that mostly calls other functions, tottime is near zero and cumtime is the total cost of the call chain. For a leaf function like fetch_category_name that makes a database call, tottime and cumtime are nearly equal because it does not call much else. The diagnostic sequence: sort by cumtime to identify the expensive subtree, sort by tottime within that subtree to identify the specific function doing the slow work.
ncalls and the per-call cost. ncalls is the call count. Dividing tottime by ncalls gives the per-call cost. When you see ncalls=4821 and tottime=27.6, the per-call cost is 5.7 milliseconds. The fix — query once, cache the results, look up from a dict — reduces the per-call cost from 5.7ms to a few nanoseconds. The total savings is (5.7ms - 0.000005ms) × 4821 ≈ 27.5 seconds. That arithmetic is only visible when you have ncalls in front of you.
cProfile overhead and when to use pstats filtering. cProfile adds overhead per function call — roughly 1-5 microseconds per call depending on the platform. For functions called millions of times, this adds up and can shift which function appears as the bottleneck. In practice, the overhead is rarely large enough to misidentify a genuine bottleneck, but the absolute numbers should be treated as approximations. The pstats.Stats API supports filtering: p.print_stats("your_module") shows only functions from your code, excluding stdlib and third-party library calls from the view. p.print_callers("fetch_category_name") shows which functions are calling the bottleneck and how many times each caller is responsible for.
The N+1 query pattern. Querying a database once per loop iteration is the most common performance issue that cProfile exposes in web and data applications. The fix is always the same: identify the set of distinct keys needed, query once for all of them, load the results into a dict, and replace the per-iteration query with a dict lookup. This pattern — query in bulk, cache in memory, look up locally — is how every ORM's select_related/prefetch_related feature works under the hood.