Day 6 · ~18m

Python Generators and yield: Process Large Data Without Running Out of Memory

Learn what a Python generator is, how yield pauses and resumes a function, and when to use yield instead of return to process large datasets one item at a time.

student (focused)

I'm pretty happy with my Day 3 comprehension. Amir saw it and didn't rewrite it, which I'm counting as a win.

teacher (neutral)

Fair metric. Let me ask you something. That comprehension pulls paid orders from a list in memory. What's the list size in production right now?

student (thinking)

I mean... we have maybe 50,000 orders in the table?

teacher (serious)

Okay. Your manager asks you to run that same filter on the full order history export — 5 million rows. What happens when you build [order for order in orders if order["status"] == "paid"] on 5 million dicts?

student (curious)

It... builds a list with all of them in memory at once.

teacher (focused)

Right. Depending on the size of each dict, you're looking at gigabytes sitting in RAM while the list builds. On a small container — the kind Railway runs on — that's a crash. Your process gets killed by the OS before it finishes.

student (surprised)

I didn't think about that. The comprehension is building the whole result before I can even look at the first item.

teacher (neutral)

That's the tradeoff you didn't have to think about with 50,000 orders. The comprehension is like a prep cook who makes the entire batch before service starts. Useful when the batch is manageable. When it's 5 million — you need a different model.

student (curious)

What's the different model?

teacher (focused)

A generator. Instead of building the whole list upfront, a generator gives you one item at a time, on demand. The prep analogy: think of a chef in a small kitchen who makes one dish as each order arrives, instead of prepping everything before the doors open. The kitchen never gets overwhelmed because there's never more than one dish in progress at a time.

student (thinking)

Okay. But how does Python actually do that? If I call a function, it runs to completion and returns something. How does it give me one thing, stop, and wait?

teacher (serious)

That's the right question, and the answer is yield. Let me show you the simplest possible case first:

def count_up(n):
    i = 0
    while i < n:
        yield i
        i += 1

Call it the same way you'd call any function:

counter = count_up(3)
print(next(counter))  # 0
print(next(counter))  # 1
print(next(counter))  # 2
student (confused)

Wait. It doesn't return anything? It just... yields?

teacher (encouraging)

Here's the mental model: when Python hits yield, the function pauses. It hands you the value and freezes in place — variables, loop state, everything. When you ask for the next item, it resumes from exactly where it stopped. return ends the function. yield pauses it and saves its spot.

student (struggling)

So the function is still alive? Between the yields?

teacher (focused)

Exactly. The function is suspended, not finished. Its local state — the value of i, where the loop is — all of that is preserved. Every call to next() wakes it up, runs until the next yield, and pauses again.

student (excited)

Oh. OH. So nothing runs until I ask for it? It's lazy.

teacher (excited)

Lazy evaluation — yes. When you write counter = count_up(3), nothing inside the function has run yet. Not a single line. The function body only executes as you call next(). That's why it can handle 5 million items without allocating 5 million items in memory — it only ever has one in flight at a time.

student (focused)

So if I loop over a generator with for...

teacher (neutral)

Python calls next() for you behind the scenes on each iteration, and stops cleanly when the generator runs out of items. You almost never call next() manually:

for count in count_up(3):
    print(count)
# 0
# 1
# 2
student (thinking)

But I still have to be careful. If I do list(count_up(1_000_000)) I'm back to building a giant list in memory, right?

teacher (surprised)

I was going to bring that up as a gotcha. You just pre-empted it. Yes — converting a generator to a list defeats the whole point if the data is huge. Generators pay off when you process items one at a time and never need them all at once.

student (amused)

So it's not a magic memory button. It's a "don't hoard the groceries" button.

teacher (amused)

That's going on a slide somewhere.

student (curious)

Okay. Let me try this on the actual problem — discount orders as they come in, without building the whole discounted list first.

teacher (focused)

Good. Here's a list comprehension version first, so you can see the side-by-side:

orders = [
    {"id": 101, "customer": "Alice Chen",   "total": 200.00, "status": "paid"},
    {"id": 102, "customer": "Bob Kumar",     "total": 80.00,  "status": "pending"},
    {"id": 103, "customer": "Carol Santos",  "total": 500.00, "status": "paid"},
]

# Comprehension — builds ALL discounted orders in memory at once:
discounted = [
    {**order, "total": round(order["total"] * (1 - 10/100), 2)}
    for order in orders
]
student (focused)

And the generator version — use yield instead of building a list?

teacher (neutral)

Exactly. Turn the comprehension inside-out into a function:

def generate_discounted_orders(orders, discount_pct):
    for order in orders:
        discounted_total = round(order["total"] * (1 - discount_pct / 100), 2)
        yield {**order, "total": discounted_total}

Now call it:

for discounted_order in generate_discounted_orders(orders, 10):
    print(discounted_order)
# {"id": 101, "customer": "Alice Chen",  "total": 180.0,  "status": "paid"}
# {"id": 102, "customer": "Bob Kumar",   "total": 72.0,   "status": "pending"}
# {"id": 103, "customer": "Carol Santos", "total": 450.0,  "status": "paid"}
student (thinking)

And with 5 million orders, this only ever has one order processed at a time while I iterate?

teacher (encouraging)

One order in memory at a time, yes. The rest are still sitting in whatever source you're reading from — a database cursor, a file, a network stream. The generator never asks for the next one until you need it.

student (curious)

I keep seeing yield from in Amir's code. Is that related?

teacher (focused)

Related but more advanced — it's for delegating from one generator to another, which gets into combining pipelines. Save that one. What matters now is the core pattern: yield pauses, resumes, and keeps state. That's the whole mechanism.

student (thinking)

One thing is nagging me. You said generators are useful for huge datasets. But our sprint ticket this week is for 12 orders in a batch job. Should I even bother?

teacher (serious)

A lot of people think generators are just for huge datasets. Actually they're useful any time you want to process items one at a time — even for 10 items. If the processing step is expensive — an API call, a database write, a calculation you might short-circuit early — a generator lets you stop as soon as you have what you need. No point computing item 11 if you only needed 3.

student (curious)

Short-circuit — like if I only want the first paid order, I break out of the loop and the rest never process?

teacher (encouraging)

Exactly. With a list comprehension, everything processes before you even start iterating. With a generator, you control exactly how many items you consume. break out of the loop, and the remaining items are never touched.

student (excited)

That's actually useful for that order validation job. We scan through orders looking for the first one that fails — we don't need to validate all 200 after we find the problem.

teacher (proud)

That's a real-world use case, and you just derived it yourself.

student (focused)

Okay, one more thing before I try the challenge. The {**order, "total": discounted_total} syntax inside the yield — that's a dict spread? Copies all keys from order and overwrites total?

teacher (neutral)

Right. **order unpacks all key-value pairs, and "total": discounted_total overrides the total key. The original order dict is unchanged — you're yielding a new dict with the updated total.

student (excited)

Clean. And it doesn't mutate the original orders list.

teacher (serious)

Which matters when the same orders list is being read by something else at the same time. Generators that yield new objects instead of mutating inputs are much safer in concurrent code — but that's a Week 3 conversation.

student (curious)

Before I go — what's next? Because I feel like there's a one-line version of this somewhere.

teacher (excited)

Day 7: generator expressions. Everything you can do with a comprehension, you can write as a generator in one line with parentheses instead of brackets. (order for order in orders if order["status"] == "paid") — that's a generator, not a list. Zero extra memory for the full result. You already know the hard part.