itertools: The Standard Library Pipeline You've Been Reinventing
Stop writing custom iterator chains. itertools.chain, islice, groupby, takewhile, dropwhile, product, combinations build lazy pipelines without the boilerplate.
I just realized something. I built a groupby function for order reporting last month. Then in a different file, I built it again for user analytics. And last week I wrote islice logic to get the top 10 items from a stream. Three times. Three different implementations of the same wheel.
The Python standard library wrote itertools so you would not have to reinvent these wheels. The number of for-loops I have seen in production that are just chain() and islice() in disguise is deeply troubling.
Wait — itertools has all of these?
It has chain, islice, groupby, takewhile, dropwhile, product, combinations, and about 30 more. All lazy. All efficient. All tested by thousands of developers. You have been writing them from scratch because you did not realize they existed.
Lazy means they do not compute until you iterate? Like generators?
Exactly. itertools functions return iterators. They do not materialize lists. You can chain 50 itertools operations together and they all happen in a single pass over the data — no intermediate lists, no memory bloat.
Okay, I need to see this. Show me the order reporting pipeline I wrote three times.
You had two sources of orders — API and database. You merged them, grouped by customer, and took the top 10 customers by order count. Here is what I bet you wrote:
def build_customer_report(source_a, source_b):
# Merge two sources
all_orders = []
for order in source_a:
all_orders.append(order)
for order in source_b:
all_orders.append(order)
# Sort by customer_id for grouping
all_orders.sort(key=lambda o: o['customer_id'])
# Group by customer_id
customer_orders = {}
for order in all_orders:
cid = order['customer_id']
if cid not in customer_orders:
customer_orders[cid] = []
customer_orders[cid].append(order)
# Take top 10 customers
top_customers = sorted(customer_orders.items(), key=lambda x: len(x[1]), reverse=True)[:10]
return [(cid, len(orders)) for cid, orders in top_customers]
Okay, yes, that is basically what I wrote. Why is that bad?
Three reasons. First: you are materializing lists and dicts. If you have a million orders, you create a million-element list and a dict with thousands of keys. Second: you are sorting and re-sorting. Third: the code is 20 lines and three nested loops. Now watch this:
from itertools import chain, islice
from operator import itemgetter
def build_customer_report(source_a, source_b):
orders = chain(source_a, source_b) # Merge lazily
orders = sorted(orders, key=itemgetter('customer_id')) # Sort by customer_id
report = []
for customer_id, group_iter in groupby(orders, key=itemgetter('customer_id')):
count = sum(1 for _ in group_iter) # Count orders in group
report.append((customer_id, count))
# Sort by count descending, take top 10
report.sort(key=itemgetter(1), reverse=True)
return list(islice(report, 10))
That is... cleaner? But I still do not see the lazy part. You still sort at some point.
True — sorting needs all the data. But before and after sorting, you are using lazy operations. chain merges the sources without creating an intermediate list. groupby groups without creating a dict. islice takes top 10 without iterating over 100. Each operation is a small, focused piece.
What is chain exactly? Does it just concatenate the lists?
chain(source_a, source_b) returns an iterator that yields from source_a, then yields from source_b. No concatenation, no new list. It just remembers which source it is in and yields items one by one.
from itertools import chain
source_a = [1, 2, 3]
source_b = [4, 5, 6]
result = chain(source_a, source_b)
print(list(result)) # [1, 2, 3, 4, 5, 6]
# The iterator is lazy — it does not create [1, 2, 3, 4, 5, 6] upfront
# It yields 1, then 2, then 3, then 4, then 5, then 6
for item in chain(source_a, source_b):
print(item) # Prints one at a time
And groupby? You said it does not create a dict.
groupby is the most powerful and the most misunderstood. It requires input sorted by the grouping key. Then it yields (key, group_iterator) tuples. The group_iterator is lazily generated — it yields items from that group only. You do not materialize the entire group.
from itertools import groupby
from operator import itemgetter
orders = [
{'customer_id': 'C001', 'amount': 50},
{'customer_id': 'C001', 'amount': 75},
{'customer_id': 'C002', 'amount': 100},
]
# groupby requires sorted input
for customer_id, group_iter in groupby(orders, key=itemgetter('customer_id')):
print(f'Customer: {customer_id}')
for order in group_iter: # group_iter is an iterator for this customer
print(f' Order: {order["amount"]}')
Wait — group_iter is an iterator, not a list? So if I do not iterate over it, the next customer group starts?
Exactly. groupby assumes you consume each group before moving to the next key. If you skip the group_iter, you skip the entire group. This is the gotcha:
for customer_id, group_iter in groupby(orders, key=itemgetter('customer_id')):
# If you do not iterate over group_iter, it is discarded when you go to the next iteration
# You cannot store it for later
groups = {} # Do NOT try to do this:
groups[customer_id] = group_iter # This is broken
# Later:
for customer_id, group in groups.items():
list(group) # Empty! The iterator was already consumed by the next groupby iteration
So you cannot store the group_iter?
Not for later use. You can either:
- Consume it immediately (like we did with sum(1 for _ in group_iter))
- Convert it to a list: groups[customer_id] = list(group_iter)
But the whole point of groupby is to avoid creating those intermediate lists. If you convert to list, you lose the efficiency.
Okay, so chain merges sources lazily, and groupby groups without materializing groups. What about islice?
islice takes start, stop, step from an iterator. It is just like list slicing but for iterators:
from itertools import islice
data = range(100) # Imagine this is a million-item iterator
# Get items 5 through 10
slice_iter = islice(data, 5, 10)
print(list(slice_iter)) # [5, 6, 7, 8, 9]
# Get first 10 items
first_ten = islice(data, 10)
print(list(first_ten)) # [0, 1, 2, ..., 9]
# Get every 2nd item starting at 0
every_other = islice(data, 0, 100, 2)
print(list(every_other)) # [0, 2, 4, 6, ...]
So you can take top N items without iterating over all of them? Like, if the iterator has a million items and you do islice(iterator, 10), it only yields 10?
Exactly. The iterator stops after 10. The remaining 999,990 items are never computed. If your data source is a database query or an API, you only fetch 10 rows, not all rows.
Okay, this is the thing I kept rebuilding. I always did result[:10] on a list. But if the source is huge, you want islice to stop early.
You have found the core insight. itertools is about stopping early, processing once, and not materializing intermediate results.
What about takewhile and dropwhile?
takewhile yields items while a condition is true, then stops:
from itertools import takewhile
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
result = takewhile(lambda x: x < 5, data) # Yields 1, 2, 3, 4, then stops
print(list(result)) # [1, 2, 3, 4]
dropwhile is the opposite — it skips items while a condition is true, then yields the rest:
from itertools import dropwhile
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
result = dropwhile(lambda x: x < 5, data) # Skips 1, 2, 3, 4, then yields 5, 6, 7, 8, 9
print(list(result)) # [5, 6, 7, 8, 9]
So takewhile and dropwhile are for stopping or skipping based on a condition?
Yes. If your data is sorted and you want to take everything up to the first item that does not match, takewhile. If you want to skip a header section and then process the rest, dropwhile.
And product and combinations?
These are for generating combinations of items:
from itertools import product, combinations
# product: all combinations with replacement
colors = ['red', 'blue']
sizes = ['S', 'M', 'L']
for combo in product(colors, sizes):
print(combo)
# ('red', 'S'), ('red', 'M'), ('red', 'L'), ('blue', 'S'), ('blue', 'M'), ('blue', 'L')
# combinations: all unique subsets of length r
items = [1, 2, 3, 4]
for combo in combinations(items, 2):
print(combo)
# (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)
So product is cartesian product? Like, every item from the first list paired with every item from the second?
Yes. And combinations is every unique pair (or triple, or n-tuple) from a list, without replacement and without regard to order. (1, 2) and (2, 1) are considered the same, so you only get (1, 2).
I have definitely written product logic before. Nested for-loops checking every combination.
Everyone has. itertools.product does it for you, lazily.
All right. So the workflow is: use chain to merge sources, use groupby to group (after sorting), use islice to take top N, use product and combinations to generate all options. And they are all lazy.
That is the workflow. And they chain together:
from itertools import chain, islice, groupby
from operator import itemgetter
# Merge two sources, group by customer_id, take top 10 customers
orders = chain(source_a, source_b)
sorted_orders = sorted(orders, key=itemgetter('customer_id'))
grouped = groupby(sorted_orders, key=itemgetter('customer_id'))
# Build report with counts
report = (
(customer_id, sum(1 for _ in group))
for customer_id, group in grouped
)
# Sort by count, take top 10
top_customers = sorted(report, key=itemgetter(1), reverse=True)
result = list(islice(top_customers, 10))
This code is:
- Small and clear
- Lazy before the sorting steps
- No intermediate dicts or lists except where necessary
- Uses library functions, not custom logic
I am going to build the report function using itertools. Let me use chain, sorted, groupby, islice, and maybe takewhile or dropwhile to filter.
Perfect. That is the pattern. Tomorrow we move to functools — singledispatch and cache for memoization. functools is about specializing functions and caching results. Today you are building pipelines. Tomorrow you are tuning them.
I wrote these pipelines by hand three times. I should have just imported itertools the first time.
Now you know. Stop reinventing. The standard library is your friend.
Practice your skills
Sign up to write and run code in this lesson.
itertools: The Standard Library Pipeline You've Been Reinventing
Stop writing custom iterator chains. itertools.chain, islice, groupby, takewhile, dropwhile, product, combinations build lazy pipelines without the boilerplate.
I just realized something. I built a groupby function for order reporting last month. Then in a different file, I built it again for user analytics. And last week I wrote islice logic to get the top 10 items from a stream. Three times. Three different implementations of the same wheel.
The Python standard library wrote itertools so you would not have to reinvent these wheels. The number of for-loops I have seen in production that are just chain() and islice() in disguise is deeply troubling.
Wait — itertools has all of these?
It has chain, islice, groupby, takewhile, dropwhile, product, combinations, and about 30 more. All lazy. All efficient. All tested by thousands of developers. You have been writing them from scratch because you did not realize they existed.
Lazy means they do not compute until you iterate? Like generators?
Exactly. itertools functions return iterators. They do not materialize lists. You can chain 50 itertools operations together and they all happen in a single pass over the data — no intermediate lists, no memory bloat.
Okay, I need to see this. Show me the order reporting pipeline I wrote three times.
You had two sources of orders — API and database. You merged them, grouped by customer, and took the top 10 customers by order count. Here is what I bet you wrote:
def build_customer_report(source_a, source_b):
# Merge two sources
all_orders = []
for order in source_a:
all_orders.append(order)
for order in source_b:
all_orders.append(order)
# Sort by customer_id for grouping
all_orders.sort(key=lambda o: o['customer_id'])
# Group by customer_id
customer_orders = {}
for order in all_orders:
cid = order['customer_id']
if cid not in customer_orders:
customer_orders[cid] = []
customer_orders[cid].append(order)
# Take top 10 customers
top_customers = sorted(customer_orders.items(), key=lambda x: len(x[1]), reverse=True)[:10]
return [(cid, len(orders)) for cid, orders in top_customers]
Okay, yes, that is basically what I wrote. Why is that bad?
Three reasons. First: you are materializing lists and dicts. If you have a million orders, you create a million-element list and a dict with thousands of keys. Second: you are sorting and re-sorting. Third: the code is 20 lines and three nested loops. Now watch this:
from itertools import chain, islice
from operator import itemgetter
def build_customer_report(source_a, source_b):
orders = chain(source_a, source_b) # Merge lazily
orders = sorted(orders, key=itemgetter('customer_id')) # Sort by customer_id
report = []
for customer_id, group_iter in groupby(orders, key=itemgetter('customer_id')):
count = sum(1 for _ in group_iter) # Count orders in group
report.append((customer_id, count))
# Sort by count descending, take top 10
report.sort(key=itemgetter(1), reverse=True)
return list(islice(report, 10))
That is... cleaner? But I still do not see the lazy part. You still sort at some point.
True — sorting needs all the data. But before and after sorting, you are using lazy operations. chain merges the sources without creating an intermediate list. groupby groups without creating a dict. islice takes top 10 without iterating over 100. Each operation is a small, focused piece.
What is chain exactly? Does it just concatenate the lists?
chain(source_a, source_b) returns an iterator that yields from source_a, then yields from source_b. No concatenation, no new list. It just remembers which source it is in and yields items one by one.
from itertools import chain
source_a = [1, 2, 3]
source_b = [4, 5, 6]
result = chain(source_a, source_b)
print(list(result)) # [1, 2, 3, 4, 5, 6]
# The iterator is lazy — it does not create [1, 2, 3, 4, 5, 6] upfront
# It yields 1, then 2, then 3, then 4, then 5, then 6
for item in chain(source_a, source_b):
print(item) # Prints one at a time
And groupby? You said it does not create a dict.
groupby is the most powerful and the most misunderstood. It requires input sorted by the grouping key. Then it yields (key, group_iterator) tuples. The group_iterator is lazily generated — it yields items from that group only. You do not materialize the entire group.
from itertools import groupby
from operator import itemgetter
orders = [
{'customer_id': 'C001', 'amount': 50},
{'customer_id': 'C001', 'amount': 75},
{'customer_id': 'C002', 'amount': 100},
]
# groupby requires sorted input
for customer_id, group_iter in groupby(orders, key=itemgetter('customer_id')):
print(f'Customer: {customer_id}')
for order in group_iter: # group_iter is an iterator for this customer
print(f' Order: {order["amount"]}')
Wait — group_iter is an iterator, not a list? So if I do not iterate over it, the next customer group starts?
Exactly. groupby assumes you consume each group before moving to the next key. If you skip the group_iter, you skip the entire group. This is the gotcha:
for customer_id, group_iter in groupby(orders, key=itemgetter('customer_id')):
# If you do not iterate over group_iter, it is discarded when you go to the next iteration
# You cannot store it for later
groups = {} # Do NOT try to do this:
groups[customer_id] = group_iter # This is broken
# Later:
for customer_id, group in groups.items():
list(group) # Empty! The iterator was already consumed by the next groupby iteration
So you cannot store the group_iter?
Not for later use. You can either:
- Consume it immediately (like we did with sum(1 for _ in group_iter))
- Convert it to a list: groups[customer_id] = list(group_iter)
But the whole point of groupby is to avoid creating those intermediate lists. If you convert to list, you lose the efficiency.
Okay, so chain merges sources lazily, and groupby groups without materializing groups. What about islice?
islice takes start, stop, step from an iterator. It is just like list slicing but for iterators:
from itertools import islice
data = range(100) # Imagine this is a million-item iterator
# Get items 5 through 10
slice_iter = islice(data, 5, 10)
print(list(slice_iter)) # [5, 6, 7, 8, 9]
# Get first 10 items
first_ten = islice(data, 10)
print(list(first_ten)) # [0, 1, 2, ..., 9]
# Get every 2nd item starting at 0
every_other = islice(data, 0, 100, 2)
print(list(every_other)) # [0, 2, 4, 6, ...]
So you can take top N items without iterating over all of them? Like, if the iterator has a million items and you do islice(iterator, 10), it only yields 10?
Exactly. The iterator stops after 10. The remaining 999,990 items are never computed. If your data source is a database query or an API, you only fetch 10 rows, not all rows.
Okay, this is the thing I kept rebuilding. I always did result[:10] on a list. But if the source is huge, you want islice to stop early.
You have found the core insight. itertools is about stopping early, processing once, and not materializing intermediate results.
What about takewhile and dropwhile?
takewhile yields items while a condition is true, then stops:
from itertools import takewhile
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
result = takewhile(lambda x: x < 5, data) # Yields 1, 2, 3, 4, then stops
print(list(result)) # [1, 2, 3, 4]
dropwhile is the opposite — it skips items while a condition is true, then yields the rest:
from itertools import dropwhile
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
result = dropwhile(lambda x: x < 5, data) # Skips 1, 2, 3, 4, then yields 5, 6, 7, 8, 9
print(list(result)) # [5, 6, 7, 8, 9]
So takewhile and dropwhile are for stopping or skipping based on a condition?
Yes. If your data is sorted and you want to take everything up to the first item that does not match, takewhile. If you want to skip a header section and then process the rest, dropwhile.
And product and combinations?
These are for generating combinations of items:
from itertools import product, combinations
# product: all combinations with replacement
colors = ['red', 'blue']
sizes = ['S', 'M', 'L']
for combo in product(colors, sizes):
print(combo)
# ('red', 'S'), ('red', 'M'), ('red', 'L'), ('blue', 'S'), ('blue', 'M'), ('blue', 'L')
# combinations: all unique subsets of length r
items = [1, 2, 3, 4]
for combo in combinations(items, 2):
print(combo)
# (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4)
So product is cartesian product? Like, every item from the first list paired with every item from the second?
Yes. And combinations is every unique pair (or triple, or n-tuple) from a list, without replacement and without regard to order. (1, 2) and (2, 1) are considered the same, so you only get (1, 2).
I have definitely written product logic before. Nested for-loops checking every combination.
Everyone has. itertools.product does it for you, lazily.
All right. So the workflow is: use chain to merge sources, use groupby to group (after sorting), use islice to take top N, use product and combinations to generate all options. And they are all lazy.
That is the workflow. And they chain together:
from itertools import chain, islice, groupby
from operator import itemgetter
# Merge two sources, group by customer_id, take top 10 customers
orders = chain(source_a, source_b)
sorted_orders = sorted(orders, key=itemgetter('customer_id'))
grouped = groupby(sorted_orders, key=itemgetter('customer_id'))
# Build report with counts
report = (
(customer_id, sum(1 for _ in group))
for customer_id, group in grouped
)
# Sort by count, take top 10
top_customers = sorted(report, key=itemgetter(1), reverse=True)
result = list(islice(top_customers, 10))
This code is:
- Small and clear
- Lazy before the sorting steps
- No intermediate dicts or lists except where necessary
- Uses library functions, not custom logic
I am going to build the report function using itertools. Let me use chain, sorted, groupby, islice, and maybe takewhile or dropwhile to filter.
Perfect. That is the pattern. Tomorrow we move to functools — singledispatch and cache for memoization. functools is about specializing functions and caching results. Today you are building pipelines. Tomorrow you are tuning them.
I wrote these pipelines by hand three times. I should have just imported itertools the first time.
Now you know. Stop reinventing. The standard library is your friend.