A 30-second script can crash and you re-run; no big deal. A 30-minute script that crashes at minute 25 is painful. The fix is checkpointing — after each item, write its status somewhere durable. On a re-run, read the checkpoints and skip what's already done.
For today's lesson the status store is a Google Sheet. The pattern stays the same with any persistent store (database, Redis, file).
items = ["a", "b", "c", "d", "e"]
# Discover spreadsheet
ss = toolset.execute_action(Action.GOOGLESHEETS_SEARCH_SPREADSHEETS, {"query": ""})
files = ss.get("files", []) or ss.get("response_data", {}).get("files", [])
spreadsheet_id = files[0]["id"]
for item in items:
# ... do the work for `item` ...
# After success, checkpoint the status
toolset.execute_action(Action.GOOGLESHEETS_SHEET_APPEND_GOOGLE_SHEET_ROW, {
"spreadsheet_id": spreadsheet_id,
"sheet_name": "Sheet1",
"values": [[item, "done"]],
})
print("all 5 processed")If the script dies after item 3, the Sheet has rows for a, b, c. On re-run you read the existing rows, build a seen set, and skip them — the script picks up at d.
This sounds like dedup with extra steps?
Dedup with persistent state. In-memory dedup is gone after a crash. Sheet-backed status survives. Same logical pattern; different durability.
What about partial work mid-item? If the script dies mid-process for item d, what happens?
That's the next layer — step-level checkpointing. Make process(item) itself idempotent (so re-running it is safe) or break it into sub-steps each with their own checkpoint. For today's lesson: item-level checkpointing is enough. The pedagogy is the durable status pattern.
Long-running automations face two failure modes:
Without checkpoints, any failure means starting from the beginning. With checkpoints, you resume from the last-known-good item.
for item in items:
# 1. Skip already-done
if item in done_set:
continue
# 2. Do the work
process(item)
# 3. Checkpoint *after* success
write_status(item, "done")Three pieces. Notice the order: checkpoint after success. If you mark before, a crash mid-process leaves the item recorded as done while the work didn't complete — silent data loss.
# Read all completed items into a set
rows = read_sheet_rows()
done_set = {row[0] for row in rows if row[1] == "done"}
# Now process only the not-yet-done
for item in items:
if item in done_set:
log("info", "skip", item=item, reason="already_done")
continue
process(item)
write_status(item, "done")A status column gives you more than "done". Common states:
| State | Meaning |
|---|---|
pending | not yet processed |
in_progress | started, not yet finished (set on entry, replaced on completion) |
done | succeeded |
failed | tried, raised; usually with an error column |
skipped | deliberately not processed |
With in_progress you can detect crashed-mid-item — those rows are stuck in flight, telling you which item the previous run crashed on.
For higher-throughput systems (thousands per second), reach for a real queue (SQS, Redis Streams, RabbitMQ, Postgres SELECT FOR UPDATE SKIP LOCKED). Same pattern, different store.
If each process(item) is itself a multi-step operation (read → transform → write), you can checkpoint within the item:
def process(item):
if not is_step_done(item, "read"):
data = read(item)
mark_step_done(item, "read")
if not is_step_done(item, "write"):
write(transform(data))
mark_step_done(item, "write")This is the production-grade shape. For today's lesson: item-level is enough.
A 30-second script can crash and you re-run; no big deal. A 30-minute script that crashes at minute 25 is painful. The fix is checkpointing — after each item, write its status somewhere durable. On a re-run, read the checkpoints and skip what's already done.
For today's lesson the status store is a Google Sheet. The pattern stays the same with any persistent store (database, Redis, file).
items = ["a", "b", "c", "d", "e"]
# Discover spreadsheet
ss = toolset.execute_action(Action.GOOGLESHEETS_SEARCH_SPREADSHEETS, {"query": ""})
files = ss.get("files", []) or ss.get("response_data", {}).get("files", [])
spreadsheet_id = files[0]["id"]
for item in items:
# ... do the work for `item` ...
# After success, checkpoint the status
toolset.execute_action(Action.GOOGLESHEETS_SHEET_APPEND_GOOGLE_SHEET_ROW, {
"spreadsheet_id": spreadsheet_id,
"sheet_name": "Sheet1",
"values": [[item, "done"]],
})
print("all 5 processed")If the script dies after item 3, the Sheet has rows for a, b, c. On re-run you read the existing rows, build a seen set, and skip them — the script picks up at d.
This sounds like dedup with extra steps?
Dedup with persistent state. In-memory dedup is gone after a crash. Sheet-backed status survives. Same logical pattern; different durability.
What about partial work mid-item? If the script dies mid-process for item d, what happens?
That's the next layer — step-level checkpointing. Make process(item) itself idempotent (so re-running it is safe) or break it into sub-steps each with their own checkpoint. For today's lesson: item-level checkpointing is enough. The pedagogy is the durable status pattern.
Long-running automations face two failure modes:
Without checkpoints, any failure means starting from the beginning. With checkpoints, you resume from the last-known-good item.
for item in items:
# 1. Skip already-done
if item in done_set:
continue
# 2. Do the work
process(item)
# 3. Checkpoint *after* success
write_status(item, "done")Three pieces. Notice the order: checkpoint after success. If you mark before, a crash mid-process leaves the item recorded as done while the work didn't complete — silent data loss.
# Read all completed items into a set
rows = read_sheet_rows()
done_set = {row[0] for row in rows if row[1] == "done"}
# Now process only the not-yet-done
for item in items:
if item in done_set:
log("info", "skip", item=item, reason="already_done")
continue
process(item)
write_status(item, "done")A status column gives you more than "done". Common states:
| State | Meaning |
|---|---|
pending | not yet processed |
in_progress | started, not yet finished (set on entry, replaced on completion) |
done | succeeded |
failed | tried, raised; usually with an error column |
skipped | deliberately not processed |
With in_progress you can detect crashed-mid-item — those rows are stuck in flight, telling you which item the previous run crashed on.
For higher-throughput systems (thousands per second), reach for a real queue (SQS, Redis Streams, RabbitMQ, Postgres SELECT FOR UPDATE SKIP LOCKED). Same pattern, different store.
If each process(item) is itself a multi-step operation (read → transform → write), you can checkpoint within the item:
def process(item):
if not is_step_done(item, "read"):
data = read(item)
mark_step_done(item, "read")
if not is_step_done(item, "write"):
write(transform(data))
mark_step_done(item, "write")This is the production-grade shape. For today's lesson: item-level is enough.
Create a free account to get started. Paid plans unlock all tracks.