Building a Data Pipeline: Read CSV → Clean → Process → Write Report
Assemble 5 days of work into one function: read CSV, parse, clean, group by region, write report. The payoff.
28 days. We have loops, dicts, file I/O, string methods, filtering. That's... a lot.
That's a data engineer. You have every piece. Today they snap together.
Snap into what?
Your boss lands a CSV on your desk Monday morning. 6 sales records. Needs a report by lunch: how much each region sold, who closed each deal, status of each one. You have two hours.
...we wrote a CSV reader. Day 25. parse_csv().
Keep going.
Then we learned to clean messy data. Day 26—clean_record(). Titles names, floats amounts, uppercase regions.
And?
** We grouped records by a key back in... Day 21? That's this loop pattern where you build a dict and append to it.
You just laid out your entire pipeline. parse_csv, clean each record, group by region, write the report.
def run_pipeline(input_file, output_file):
records = parse_csv(input_file)
cleaned = [clean_record(r) for r in records if r.get('name')]
# Group by region
by_region = {}
for record in cleaned:
region = record['region']
if region not in by_region:
by_region[region] = []
by_region[region].append(record)
# Write report
with open(output_file, 'w') as f:
for region, sales in sorted(by_region.items()):
total = sum(s['amount'] for s in sales)
f.write(f"=== {region} === Total: ${total:,.2f}\n")
for s in sales:
f.write(f" {s['name']}: ${s['amount']:,.2f} ({s['status']})\n")
So run_pipeline() is just... calling those in order?
Write it. See if Monday morning gets solved.
I can actually do this. This is the thing that felt impossible on Day 10.
You're not the same coder you were on Day 10. Tomorrow we test what stuck—and I think you know the answer.
Practice your skills
Sign up to write and run code in this lesson.
Building a Data Pipeline: Read CSV → Clean → Process → Write Report
Assemble 5 days of work into one function: read CSV, parse, clean, group by region, write report. The payoff.
28 days. We have loops, dicts, file I/O, string methods, filtering. That's... a lot.
That's a data engineer. You have every piece. Today they snap together.
Snap into what?
Your boss lands a CSV on your desk Monday morning. 6 sales records. Needs a report by lunch: how much each region sold, who closed each deal, status of each one. You have two hours.
...we wrote a CSV reader. Day 25. parse_csv().
Keep going.
Then we learned to clean messy data. Day 26—clean_record(). Titles names, floats amounts, uppercase regions.
And?
** We grouped records by a key back in... Day 21? That's this loop pattern where you build a dict and append to it.
You just laid out your entire pipeline. parse_csv, clean each record, group by region, write the report.
def run_pipeline(input_file, output_file):
records = parse_csv(input_file)
cleaned = [clean_record(r) for r in records if r.get('name')]
# Group by region
by_region = {}
for record in cleaned:
region = record['region']
if region not in by_region:
by_region[region] = []
by_region[region].append(record)
# Write report
with open(output_file, 'w') as f:
for region, sales in sorted(by_region.items()):
total = sum(s['amount'] for s in sales)
f.write(f"=== {region} === Total: ${total:,.2f}\n")
for s in sales:
f.write(f" {s['name']}: ${s['amount']:,.2f} ({s['status']})\n")
So run_pipeline() is just... calling those in order?
Write it. See if Monday morning gets solved.
I can actually do this. This is the thing that felt impossible on Day 10.
You're not the same coder you were on Day 10. Tomorrow we test what stuck—and I think you know the answer.