Day 27 · ~14m

Processing Files

Processing files line by line — stripping whitespace, splitting CSV-like data, and building data structures from raw text.

🧑‍💻

Yesterday I parsed a simple file — name and score on each line. But my real spreadsheet exports have headers, multiple columns, sometimes blank rows... How do I handle all that?

👩‍🏫

Real file processing follows a consistent pattern: read, split, clean, structure. Once you internalize this, you can handle any text-based data file. Let's work through a CSV — the most common data format you'll encounter in the real world:

name,math,science,english Alice,92,88,95 Bob,85,72,80 Charlie,67,71,60

First line is the header — column names. Everything after is data. Here's the pattern:

content = zuzu.files.read("grades.csv")
lines = content.strip().split("\n")

header = lines[0].split(",")       # ["name", "math", "science", "english"]
data_lines = lines[1:]              # Everything after the header

students = []
for line in data_lines:
    values = line.split(",")
    student = {
        "name": values[0],
        "math": int(values[1]),
        "science": int(values[2]),
        "english": int(values[3])
    }
    students.append(student)

Now students is a list of dictionaries — the exact structure you built by hand in Week 3. Except now the data came from a file.

🧑‍💻

That's a lot of manual indexing. What if the file has 20 columns? I'm not writing values[0] through values[19].

👩‍🏫

Good instinct. Use zip() to pair headers with values automatically:

header = lines[0].split(",")
for line in lines[1:]:
    values = line.split(",")
    row = dict(zip(header, values))
    # row = {"name": "Alice", "math": "92", "science": "88", "english": "95"}

zip(header, values) pairs each column name with its value, and dict() turns those pairs into a dictionary. Works for 3 columns or 30 — the code doesn't change.

One catch: all values come out as strings. Convert numeric fields yourself:

for key in row:
    if key != "name":
        row[key] = int(row[key])
🧑‍💻

What about messy data? Extra spaces, blank lines, the kind of thing that's always in real exports?

👩‍🏫

Clean as you go. .strip() is your best friend for file processing:

lines = content.strip().split("\n")
for line in lines:
    line = line.strip()          # Remove leading/trailing whitespace
    if not line:                 # Skip blank lines
        continue
    values = [v.strip() for v in line.split(",")]  # Clean each value

That list comprehension — [v.strip() for v in line.split(",")] — is the standard pattern for splitting and cleaning in one step. You'll write it hundreds of times in your career.

🧑‍💻

So I could read a file, process it with a loop, filter with a dictionary, and write the results? That's basically what I've been doing manually in Excel for three years.

👩‍🏫

That's exactly right. And not every file needs to become a list of dicts. Sometimes you process on the fly — counting or summing without storing everything:

total = 0
count = 0
for line in lines[1:]:  # Skip header
    values = line.strip().split(",")
    score = int(values[1])
    total += score
    count += 1

average = round(total / count, 1)

This is memory-efficient — you never hold the whole dataset in memory. For files with millions of rows, this matters.

🧑‍💻

Can I combine this with the set and dictionary techniques from earlier this week?

👩‍🏫

Absolutely — and that's where it all comes together. Read a file, build a dictionary of lists, filter with comprehensions, aggregate with counting patterns:

# Build a dict: subject -> list of scores
subject_scores = {}
for line in lines[1:]:
    values = line.strip().split(",")
    name = values[0]
    for i, subject in enumerate(subjects):
        score = int(values[i + 1])
        if subject not in subject_scores:
            subject_scores[subject] = []
        subject_scores[subject].append(score)

# Now analyze: average per subject
for subject, scores in subject_scores.items():
    avg = round(sum(scores) / len(scores), 1)
    print(f"{subject}: {avg}")

File to data structure to analysis. Every skill from the last three weeks — strings, loops, lists, dictionaries, sets — feeding into this one pipeline. Let's practice it.

Practice your skills

Sign up to write and run code in this lesson.

Already have an account? Sign in