Threading: Parallel Work, the GIL, and When Threads Actually Help
Master threading.Thread and the Global Interpreter Lock. Learn when threads speed up I/O-bound work and why CPU-bound code stays single-threaded despite multiple threads.
Your message said the order processing pipeline is taking 8 seconds to process a batch of 10 orders, and it crashes under load. That's the thing I've been staring at all week.
That is exactly the thing. The pipeline fetches order details from a payment API, validates each order's data, applies discounts, and writes the results to Postgres. It does all of that one order at a time. While waiting for the payment API to respond — which takes roughly 500 milliseconds per order — the entire pipeline sits idle. 10 orders times 500 milliseconds is 5 seconds right there. Just waiting.
So while the code is asking the API "hey, what is the status of ORD-001," nothing else is happening. Not validation, not discount logic, nothing. Just... waiting.
Nothing else is happening in your pipeline. The CPU is not busy. The code is not computing. The process is blocked on I/O — input/output, waiting for the network. This week is about fixing that. Today: threading. Next week: what to do when the work is NOT I/O.
On Day 3 we talked about dunders and how __iter__ and __len__ let Python see inside your object. Does threading have dunders too, or is it something totally different?
Good instinct to connect backward. Threading does not have dunders — it is a library, not a protocol. But the mental model is similar: you are wiring your code into Python's threading machinery. Today you will see what that wiring looks like and when it is worth the effort.
Let me start with the simplest possible case. Forget the pipeline for a moment. Imagine a kitchen during service. One chef. Multiple orders arrive. The chef has to:
- Start the risotto
- Wait 10 minutes for it to cook
- Start the salmon
- Wait 8 minutes for it to cook
- Plate and serve
Done: 18 minutes for two dishes.
Or the chef could have a helper.
Exactly. Now we have two chefs, but one stove. The first chef starts the risotto and hands it to the second chef. "Watch this, I have to do something else." The first chef starts the salmon. While the first chef is working on the salmon, the second chef is watching the risotto cook. Neither chef is cooking at the same instant — the stove can only do one thing at a time. But the kitchen is never idle.
Same kitchen. Same stove. But the work happens in parallel because both chefs can start tasks and then do something else while the stove is busy.
The stove is shared. Both chefs need it eventually. But while one chef is waiting for the stove, the other chef can prep or work on something that does not need the stove.
That is the entire mental model. The stove is the Global Interpreter Lock — the GIL. Every Python thread shares it. Multiple threads can exist, but only one thread executes Python bytecode at a time. The GIL is held by the thread that is running. Any other thread has to wait.
BUT — and this is crucial — when a thread is waiting for I/O, it releases the GIL. The thread is blocked, sleeping, waiting for the network or disk. While it sleeps, other threads can acquire the GIL and run. That is how threading helps with I/O-bound work.
So the thread does not disappear while it is waiting? It is still there, still exists, but it is not... using the GIL?
Exactly right. The thread is blocked on I/O — waiting for the network. The operating system pauses that thread. Python releases the GIL. Another thread wakes up and acquires the GIL. When the I/O completes, the blocked thread wakes up and waits for the GIL to be free so it can run again.
The chef analogy still works: First chef is waiting for the risotto. The risotto is on the stove. The stove is not the GIL anymore — the stove is the OS, waiting for heat. The first chef steps away from the kitchen entirely. The second chef grabs the stove and starts the salmon. When the risotto timer goes off, the first chef comes back and one of them will use the stove next.
So threading works great for I/O? The threads are not actually running at the same time, but while one is stuck waiting, the other can do something useful. And the GIL is Python's way of saying "one bytecode instruction at a time, for safety."
You just summarized it perfectly. The GIL is Python's most controversial feature. It is also the reason your threaded code doesn't corrupt memory when two threads write to the same dict at the same time. You're welcome.
Okay, but how does threading actually... work? Like, code-wise. Do I have to do something special to create a thread?
You use threading.Thread. It is the simplest possible API:
import threading
def fetch_order_status(order_id):
# Simulate I/O — pretend we are waiting for the API
import time
time.sleep(1)
return f'{order_id}: paid'
t = threading.Thread(target=fetch_order_status, args=('ORD-001',))
t.start() # Thread begins running immediately
t.join() # Wait for the thread to finish
print('Done')
Call start() and the thread runs in the background. Call join() and your code waits for it to finish. Dead simple.
So I give it a function and arguments, call start(), and Python spins up a new thread running that function. But if I want to run 10 functions at the same time, I create 10 threads?
Yes. But now the problem gets real: how do you collect the results? The thread runs independently. The function finishes. The result is trapped inside the thread with no way to get it out.
This is why threading is harder than it looks. Each thread needs to put its result somewhere that the main thread can see. A shared dictionary, a queue, a list — something mutable that multiple threads can write to. And that brings us to the next problem.
Okay, multiple threads writing to the same data structure. That sounds like a race condition waiting to happen.
It absolutely is, unless you protect it. This is where locks come in.
import threading
results = {}
lock = threading.Lock()
def fetch_and_store(order_id):
# Simulate I/O
import time
time.sleep(1)
status = f'{order_id}: paid'
# Critical section — protect with a lock
lock.acquire()
results[order_id] = status
lock.release()
threads = []
for order_id in ['ORD-001', 'ORD-002', 'ORD-003']:
t = threading.Thread(target=fetch_and_store, args=(order_id,))
t.start()
threads.append(t)
for t in threads:
t.join()
print(results)
The lock ensures that only one thread writes to results at a time. Before you modify shared data, you acquire the lock. After you are done, you release it. If another thread tries to acquire the lock while it is held, that thread blocks and waits.
So locks are the safety mechanism. Two threads cannot write to the dict simultaneously because the lock stops the second thread and makes it wait until the first one is done.
Exactly. And there is a cleaner syntax using with:
with lock:
results[order_id] = status
The context manager acquires the lock on entry and releases it on exit. Even if an exception happens inside the block, the lock gets released. This is why context managers are so valuable in concurrent code — they guarantee cleanup.
Okay, so I create a threading.Lock, I use with lock: to protect shared data, I spawn threads with threading.Thread, and I call join() to wait for them. What is the catch? Why is threading not the default for everything with I/O?
Several catches. One: threading adds complexity. Your code is now multi-threaded. Bugs become harder to reproduce. Race conditions are subtle. You have to think about what is shared and what is not.
Two: threads have memory overhead. Each thread gets its own stack — roughly 8 megabytes on CPython. Create 100 threads and you've used 800 megabytes just on stacks. With your payment API taking 500 milliseconds per request, processing 100 orders simultaneously means 100 threads sitting idle most of the time.
Three: the thread scheduler is a blunt instrument. Python does not decide when to switch threads — the OS does. You can get context switching at unexpected moments, which is why the lock is essential.
But for the pipeline problem — 10 orders, 500 milliseconds each — threading would cut the time from 5 seconds of waiting down to 500 milliseconds. That is a 10x speedup for I/O-bound work.
That is exactly the calculation. 10 threads, each fetching one order from the API. All threads are blocked on I/O at roughly the same time. When all 10 requests complete, all 10 threads wake up and finish. The slowest thread determines the total time. If all 10 requests take 500 milliseconds, the whole batch finishes in roughly 500 milliseconds instead of 5 seconds.
That is the fix. That is the thing I am supposed to do to the pipeline.
That is part of the fix. Threading helps with I/O. But there are two problems in your pipeline: one is network I/O. The other is validation logic — order data checking, discount calculation. That is CPU work. CPU work does not release the GIL.
So if I thread the I/O part and it all completes fast, then the main thread has to do the CPU work — validation and discounts — for each of those 10 orders. Single-threaded. No speedup there.
Correct. Threading the I/O solves half the problem. The CPU-bound validation still runs one order at a time. You still have threads in the pipeline, but if you added validation inside each thread, you would not see the speedup you might expect because the threads are competing for the GIL while they do the CPU work.
There is probably another solution for CPU-bound work.
There is. But that is tomorrow. Today: threading. Tomorrow: the option that actually paralyzes CPU work.
Let me show you the full pattern for threading the I/O part. A mock payment API, multiple orders, and a threaded fetch:
import threading
import time
from random import uniform
def mock_payment_api_call(order_id):
"""Simulate a 400-600ms API call."""
time.sleep(uniform(0.4, 0.6))
statuses = {'ORD-001': 'paid', 'ORD-002': 'pending', 'ORD-003': 'failed'}
return statuses.get(order_id, 'unknown')
def fetch_order_statuses(order_ids):
"""Fetch statuses for all orders concurrently using threads."""
results = {}
lock = threading.Lock()
threads = []
def fetch_and_store(oid):
status = mock_payment_api_call(oid)
with lock:
results[oid] = status
for order_id in order_ids:
t = threading.Thread(target=fetch_and_store, args=(order_id,))
t.start()
threads.append(t)
for t in threads:
t.join()
return results
# Test it
start = time.time()
statuses = fetch_order_statuses(['ORD-001', 'ORD-002', 'ORD-003'])
print(f'Took {time.time() - start:.2f} seconds')
print(statuses)
Run this and you see the difference: 10 orders that each take 500ms sequentially would take 5 seconds. With threads, they take roughly 500 milliseconds total because all 10 threads are sleeping simultaneously.
The threads all start, all call the mock API, all block on sleep at basically the same time. The OS wakes all of them when their sleep times are up. Then they all finish almost at once. The whole thing takes as long as the slowest request.
You just described the exact behavior. That is the mental model: you are not making I/O faster. You are making the system not idle while I/O happens.
So for the next step I need to understand: what happens when I do validation and discounts inside each thread? Do I get the speedup, or does the GIL prevent it?
The GIL prevents it. That is tomorrow's problem. Today: threads and I/O. Tomorrow: when threads do not work and what to use instead.
I can apply this to the pipeline today. Threading the I/O, collecting results safely with a lock. That is a start.
That is a solid start. Your pipeline will go from 8 seconds to roughly 500 milliseconds on the I/O part. The validation part still needs work. But 8 seconds to 500 milliseconds is not nothing. Measure it. Show the before and after. Then we tackle the CPU-bound work.
Practice your skills
Sign up to write and run code in this lesson.
Threading: Parallel Work, the GIL, and When Threads Actually Help
Master threading.Thread and the Global Interpreter Lock. Learn when threads speed up I/O-bound work and why CPU-bound code stays single-threaded despite multiple threads.
Your message said the order processing pipeline is taking 8 seconds to process a batch of 10 orders, and it crashes under load. That's the thing I've been staring at all week.
That is exactly the thing. The pipeline fetches order details from a payment API, validates each order's data, applies discounts, and writes the results to Postgres. It does all of that one order at a time. While waiting for the payment API to respond — which takes roughly 500 milliseconds per order — the entire pipeline sits idle. 10 orders times 500 milliseconds is 5 seconds right there. Just waiting.
So while the code is asking the API "hey, what is the status of ORD-001," nothing else is happening. Not validation, not discount logic, nothing. Just... waiting.
Nothing else is happening in your pipeline. The CPU is not busy. The code is not computing. The process is blocked on I/O — input/output, waiting for the network. This week is about fixing that. Today: threading. Next week: what to do when the work is NOT I/O.
On Day 3 we talked about dunders and how __iter__ and __len__ let Python see inside your object. Does threading have dunders too, or is it something totally different?
Good instinct to connect backward. Threading does not have dunders — it is a library, not a protocol. But the mental model is similar: you are wiring your code into Python's threading machinery. Today you will see what that wiring looks like and when it is worth the effort.
Let me start with the simplest possible case. Forget the pipeline for a moment. Imagine a kitchen during service. One chef. Multiple orders arrive. The chef has to:
- Start the risotto
- Wait 10 minutes for it to cook
- Start the salmon
- Wait 8 minutes for it to cook
- Plate and serve
Done: 18 minutes for two dishes.
Or the chef could have a helper.
Exactly. Now we have two chefs, but one stove. The first chef starts the risotto and hands it to the second chef. "Watch this, I have to do something else." The first chef starts the salmon. While the first chef is working on the salmon, the second chef is watching the risotto cook. Neither chef is cooking at the same instant — the stove can only do one thing at a time. But the kitchen is never idle.
Same kitchen. Same stove. But the work happens in parallel because both chefs can start tasks and then do something else while the stove is busy.
The stove is shared. Both chefs need it eventually. But while one chef is waiting for the stove, the other chef can prep or work on something that does not need the stove.
That is the entire mental model. The stove is the Global Interpreter Lock — the GIL. Every Python thread shares it. Multiple threads can exist, but only one thread executes Python bytecode at a time. The GIL is held by the thread that is running. Any other thread has to wait.
BUT — and this is crucial — when a thread is waiting for I/O, it releases the GIL. The thread is blocked, sleeping, waiting for the network or disk. While it sleeps, other threads can acquire the GIL and run. That is how threading helps with I/O-bound work.
So the thread does not disappear while it is waiting? It is still there, still exists, but it is not... using the GIL?
Exactly right. The thread is blocked on I/O — waiting for the network. The operating system pauses that thread. Python releases the GIL. Another thread wakes up and acquires the GIL. When the I/O completes, the blocked thread wakes up and waits for the GIL to be free so it can run again.
The chef analogy still works: First chef is waiting for the risotto. The risotto is on the stove. The stove is not the GIL anymore — the stove is the OS, waiting for heat. The first chef steps away from the kitchen entirely. The second chef grabs the stove and starts the salmon. When the risotto timer goes off, the first chef comes back and one of them will use the stove next.
So threading works great for I/O? The threads are not actually running at the same time, but while one is stuck waiting, the other can do something useful. And the GIL is Python's way of saying "one bytecode instruction at a time, for safety."
You just summarized it perfectly. The GIL is Python's most controversial feature. It is also the reason your threaded code doesn't corrupt memory when two threads write to the same dict at the same time. You're welcome.
Okay, but how does threading actually... work? Like, code-wise. Do I have to do something special to create a thread?
You use threading.Thread. It is the simplest possible API:
import threading
def fetch_order_status(order_id):
# Simulate I/O — pretend we are waiting for the API
import time
time.sleep(1)
return f'{order_id}: paid'
t = threading.Thread(target=fetch_order_status, args=('ORD-001',))
t.start() # Thread begins running immediately
t.join() # Wait for the thread to finish
print('Done')
Call start() and the thread runs in the background. Call join() and your code waits for it to finish. Dead simple.
So I give it a function and arguments, call start(), and Python spins up a new thread running that function. But if I want to run 10 functions at the same time, I create 10 threads?
Yes. But now the problem gets real: how do you collect the results? The thread runs independently. The function finishes. The result is trapped inside the thread with no way to get it out.
This is why threading is harder than it looks. Each thread needs to put its result somewhere that the main thread can see. A shared dictionary, a queue, a list — something mutable that multiple threads can write to. And that brings us to the next problem.
Okay, multiple threads writing to the same data structure. That sounds like a race condition waiting to happen.
It absolutely is, unless you protect it. This is where locks come in.
import threading
results = {}
lock = threading.Lock()
def fetch_and_store(order_id):
# Simulate I/O
import time
time.sleep(1)
status = f'{order_id}: paid'
# Critical section — protect with a lock
lock.acquire()
results[order_id] = status
lock.release()
threads = []
for order_id in ['ORD-001', 'ORD-002', 'ORD-003']:
t = threading.Thread(target=fetch_and_store, args=(order_id,))
t.start()
threads.append(t)
for t in threads:
t.join()
print(results)
The lock ensures that only one thread writes to results at a time. Before you modify shared data, you acquire the lock. After you are done, you release it. If another thread tries to acquire the lock while it is held, that thread blocks and waits.
So locks are the safety mechanism. Two threads cannot write to the dict simultaneously because the lock stops the second thread and makes it wait until the first one is done.
Exactly. And there is a cleaner syntax using with:
with lock:
results[order_id] = status
The context manager acquires the lock on entry and releases it on exit. Even if an exception happens inside the block, the lock gets released. This is why context managers are so valuable in concurrent code — they guarantee cleanup.
Okay, so I create a threading.Lock, I use with lock: to protect shared data, I spawn threads with threading.Thread, and I call join() to wait for them. What is the catch? Why is threading not the default for everything with I/O?
Several catches. One: threading adds complexity. Your code is now multi-threaded. Bugs become harder to reproduce. Race conditions are subtle. You have to think about what is shared and what is not.
Two: threads have memory overhead. Each thread gets its own stack — roughly 8 megabytes on CPython. Create 100 threads and you've used 800 megabytes just on stacks. With your payment API taking 500 milliseconds per request, processing 100 orders simultaneously means 100 threads sitting idle most of the time.
Three: the thread scheduler is a blunt instrument. Python does not decide when to switch threads — the OS does. You can get context switching at unexpected moments, which is why the lock is essential.
But for the pipeline problem — 10 orders, 500 milliseconds each — threading would cut the time from 5 seconds of waiting down to 500 milliseconds. That is a 10x speedup for I/O-bound work.
That is exactly the calculation. 10 threads, each fetching one order from the API. All threads are blocked on I/O at roughly the same time. When all 10 requests complete, all 10 threads wake up and finish. The slowest thread determines the total time. If all 10 requests take 500 milliseconds, the whole batch finishes in roughly 500 milliseconds instead of 5 seconds.
That is the fix. That is the thing I am supposed to do to the pipeline.
That is part of the fix. Threading helps with I/O. But there are two problems in your pipeline: one is network I/O. The other is validation logic — order data checking, discount calculation. That is CPU work. CPU work does not release the GIL.
So if I thread the I/O part and it all completes fast, then the main thread has to do the CPU work — validation and discounts — for each of those 10 orders. Single-threaded. No speedup there.
Correct. Threading the I/O solves half the problem. The CPU-bound validation still runs one order at a time. You still have threads in the pipeline, but if you added validation inside each thread, you would not see the speedup you might expect because the threads are competing for the GIL while they do the CPU work.
There is probably another solution for CPU-bound work.
There is. But that is tomorrow. Today: threading. Tomorrow: the option that actually paralyzes CPU work.
Let me show you the full pattern for threading the I/O part. A mock payment API, multiple orders, and a threaded fetch:
import threading
import time
from random import uniform
def mock_payment_api_call(order_id):
"""Simulate a 400-600ms API call."""
time.sleep(uniform(0.4, 0.6))
statuses = {'ORD-001': 'paid', 'ORD-002': 'pending', 'ORD-003': 'failed'}
return statuses.get(order_id, 'unknown')
def fetch_order_statuses(order_ids):
"""Fetch statuses for all orders concurrently using threads."""
results = {}
lock = threading.Lock()
threads = []
def fetch_and_store(oid):
status = mock_payment_api_call(oid)
with lock:
results[oid] = status
for order_id in order_ids:
t = threading.Thread(target=fetch_and_store, args=(order_id,))
t.start()
threads.append(t)
for t in threads:
t.join()
return results
# Test it
start = time.time()
statuses = fetch_order_statuses(['ORD-001', 'ORD-002', 'ORD-003'])
print(f'Took {time.time() - start:.2f} seconds')
print(statuses)
Run this and you see the difference: 10 orders that each take 500ms sequentially would take 5 seconds. With threads, they take roughly 500 milliseconds total because all 10 threads are sleeping simultaneously.
The threads all start, all call the mock API, all block on sleep at basically the same time. The OS wakes all of them when their sleep times are up. Then they all finish almost at once. The whole thing takes as long as the slowest request.
You just described the exact behavior. That is the mental model: you are not making I/O faster. You are making the system not idle while I/O happens.
So for the next step I need to understand: what happens when I do validation and discounts inside each thread? Do I get the speedup, or does the GIL prevent it?
The GIL prevents it. That is tomorrow's problem. Today: threads and I/O. Tomorrow: when threads do not work and what to use instead.
I can apply this to the pipeline today. Threading the I/O, collecting results safely with a lock. That is a start.
That is a solid start. Your pipeline will go from 8 seconds to roughly 500 milliseconds on the I/O part. The validation part still needs work. But 8 seconds to 500 milliseconds is not nothing. Measure it. Show the before and after. Then we tackle the CPU-bound work.