Processing Files
Processing files line by line — stripping whitespace, splitting CSV-like data, and building data structures from raw text.
Yesterday I parsed a simple file — name and score on each line. But my real spreadsheet exports have headers, multiple columns, sometimes blank rows... How do I handle all that?
Real file processing follows a consistent pattern: read, split, clean, structure. Once you internalize this, you can handle any text-based data file. Let's work through a CSV — the most common data format you'll encounter in the real world:
name,math,science,english
Alice,92,88,95
Bob,85,72,80
Charlie,67,71,60
First line is the header — column names. Everything after is data. Here's the pattern:
content = zuzu.files.read("grades.csv")
lines = content.strip().split("\n")
header = lines[0].split(",") # ["name", "math", "science", "english"]
data_lines = lines[1:] # Everything after the header
students = []
for line in data_lines:
values = line.split(",")
student = {
"name": values[0],
"math": int(values[1]),
"science": int(values[2]),
"english": int(values[3])
}
students.append(student)
Now students is a list of dictionaries — the exact structure you built by hand in Week 3. Except now the data came from a file.
That's a lot of manual indexing. What if the file has 20 columns? I'm not writing values[0] through values[19].
Good instinct. Use zip() to pair headers with values automatically:
header = lines[0].split(",")
for line in lines[1:]:
values = line.split(",")
row = dict(zip(header, values))
# row = {"name": "Alice", "math": "92", "science": "88", "english": "95"}
zip(header, values) pairs each column name with its value, and dict() turns those pairs into a dictionary. Works for 3 columns or 30 — the code doesn't change.
One catch: all values come out as strings. Convert numeric fields yourself:
for key in row:
if key != "name":
row[key] = int(row[key])
What about messy data? Extra spaces, blank lines, the kind of thing that's always in real exports?
Clean as you go. .strip() is your best friend for file processing:
lines = content.strip().split("\n")
for line in lines:
line = line.strip() # Remove leading/trailing whitespace
if not line: # Skip blank lines
continue
values = [v.strip() for v in line.split(",")] # Clean each value
That list comprehension — [v.strip() for v in line.split(",")] — is the standard pattern for splitting and cleaning in one step. You'll write it hundreds of times in your career.
So I could read a file, process it with a loop, filter with a dictionary, and write the results? That's basically what I've been doing manually in Excel for three years.
That's exactly right. And not every file needs to become a list of dicts. Sometimes you process on the fly — counting or summing without storing everything:
total = 0
count = 0
for line in lines[1:]: # Skip header
values = line.strip().split(",")
score = int(values[1])
total += score
count += 1
average = round(total / count, 1)
This is memory-efficient — you never hold the whole dataset in memory. For files with millions of rows, this matters.
Can I combine this with the set and dictionary techniques from earlier this week?
Absolutely — and that's where it all comes together. Read a file, build a dictionary of lists, filter with comprehensions, aggregate with counting patterns:
# Build a dict: subject -> list of scores
subject_scores = {}
for line in lines[1:]:
values = line.strip().split(",")
name = values[0]
for i, subject in enumerate(subjects):
score = int(values[i + 1])
if subject not in subject_scores:
subject_scores[subject] = []
subject_scores[subject].append(score)
# Now analyze: average per subject
for subject, scores in subject_scores.items():
avg = round(sum(scores) / len(scores), 1)
print(f"{subject}: {avg}")
File to data structure to analysis. Every skill from the last three weeks — strings, loops, lists, dictionaries, sets — feeding into this one pipeline. Let's practice it.
Practice your skills
Sign up to write and run code in this lesson.
Processing Files
Processing files line by line — stripping whitespace, splitting CSV-like data, and building data structures from raw text.
Yesterday I parsed a simple file — name and score on each line. But my real spreadsheet exports have headers, multiple columns, sometimes blank rows... How do I handle all that?
Real file processing follows a consistent pattern: read, split, clean, structure. Once you internalize this, you can handle any text-based data file. Let's work through a CSV — the most common data format you'll encounter in the real world:
name,math,science,english
Alice,92,88,95
Bob,85,72,80
Charlie,67,71,60
First line is the header — column names. Everything after is data. Here's the pattern:
content = zuzu.files.read("grades.csv")
lines = content.strip().split("\n")
header = lines[0].split(",") # ["name", "math", "science", "english"]
data_lines = lines[1:] # Everything after the header
students = []
for line in data_lines:
values = line.split(",")
student = {
"name": values[0],
"math": int(values[1]),
"science": int(values[2]),
"english": int(values[3])
}
students.append(student)
Now students is a list of dictionaries — the exact structure you built by hand in Week 3. Except now the data came from a file.
That's a lot of manual indexing. What if the file has 20 columns? I'm not writing values[0] through values[19].
Good instinct. Use zip() to pair headers with values automatically:
header = lines[0].split(",")
for line in lines[1:]:
values = line.split(",")
row = dict(zip(header, values))
# row = {"name": "Alice", "math": "92", "science": "88", "english": "95"}
zip(header, values) pairs each column name with its value, and dict() turns those pairs into a dictionary. Works for 3 columns or 30 — the code doesn't change.
One catch: all values come out as strings. Convert numeric fields yourself:
for key in row:
if key != "name":
row[key] = int(row[key])
What about messy data? Extra spaces, blank lines, the kind of thing that's always in real exports?
Clean as you go. .strip() is your best friend for file processing:
lines = content.strip().split("\n")
for line in lines:
line = line.strip() # Remove leading/trailing whitespace
if not line: # Skip blank lines
continue
values = [v.strip() for v in line.split(",")] # Clean each value
That list comprehension — [v.strip() for v in line.split(",")] — is the standard pattern for splitting and cleaning in one step. You'll write it hundreds of times in your career.
So I could read a file, process it with a loop, filter with a dictionary, and write the results? That's basically what I've been doing manually in Excel for three years.
That's exactly right. And not every file needs to become a list of dicts. Sometimes you process on the fly — counting or summing without storing everything:
total = 0
count = 0
for line in lines[1:]: # Skip header
values = line.strip().split(",")
score = int(values[1])
total += score
count += 1
average = round(total / count, 1)
This is memory-efficient — you never hold the whole dataset in memory. For files with millions of rows, this matters.
Can I combine this with the set and dictionary techniques from earlier this week?
Absolutely — and that's where it all comes together. Read a file, build a dictionary of lists, filter with comprehensions, aggregate with counting patterns:
# Build a dict: subject -> list of scores
subject_scores = {}
for line in lines[1:]:
values = line.strip().split(",")
name = values[0]
for i, subject in enumerate(subjects):
score = int(values[i + 1])
if subject not in subject_scores:
subject_scores[subject] = []
subject_scores[subject].append(score)
# Now analyze: average per subject
for subject, scores in subject_scores.items():
avg = round(sum(scores) / len(scores), 1)
print(f"{subject}: {avg}")
File to data structure to analysis. Every skill from the last three weeks — strings, loops, lists, dictionaries, sets — feeding into this one pipeline. Let's practice it.
Practice your skills
Sign up to write and run code in this lesson.