Day 25 · ~13m

Parsing CSV Data: Split Lines into Fields

Split CSV lines into fields and parse them into dictionaries. Learn to strip whitespace and zip headers with values.

teacher (neutral)

Yesterday you read the CSV file. You got lines. But look at one:

'Alice Chen,1250.50,West,confirmed'  # readlines() includes the newline

You have 7 lines of text. But each line is still just one long string. 'Alice Chen,1250.50,West,confirmed' isn't four values — it's one. You need to split it.

The Problem: Strings vs. Data

student (confused)

Wait, I have the data. It's right there in the string.

teacher (encouraging)

You have text that looks like data. But in your code, it's just a string. You can't ask: "What's the amount?" You'd have to do messy string slicing. We need structure.

# This doesn't work:
line = 'Alice Chen,1250.50,West,confirmed'
amount = line[11:19]  # ❌ fragile, error-prone

# This is what we want:
record = {'name': 'Alice Chen', 'amount': '1250.50', 'region': 'West', 'status': 'confirmed'}
amount = record['amount']  # ✓ clear, safe

Splitting with .split(',')

The solution is the .split() method. It breaks a string on a delimiter.

line = 'Alice Chen,1250.50,West,confirmed'
fields = line.split(',')
print(fields)
# ['Alice Chen', '1250.50', 'West', 'confirmed']

Now we have a list. But there's a problem: the header row.

Maya(thinking): The first line is different. It's the header. 'name,amount,region,status'. So I read that separately?

teacher (proud)

Exactly. The first line tells you what each field means. The rest are data.

Header-Driven Parsing

Here's the pattern:

  1. Read all lines from the file
  2. The first line becomes your headers (column names)
  3. Each remaining line becomes a record (a dict mapping header → value)
def parse_csv(filepath):
    with open(filepath) as f:
        lines = f.readlines()
    
    if not lines:
        return []
    
    # First line is headers
    header_line = lines[0].strip()  # strip() removes the trailing newline
    headers = header_line.split(',')
    
    records = []
    for line in lines[1:]:  # Skip the header, process the rest
        line = line.strip()  # Clean up whitespace
        if not line:  # Skip empty lines
            continue
        values = line.split(',')
        record = dict(zip(headers, values))
        records.append(record)
    
    return records
student (amused)

Wait, zip()? What's that doing?

teacher (focused)

zip() pairs up two lists element-by-element:

headers = ['name', 'amount', 'region', 'status']
values = ['Alice Chen', '1250.50', 'West', 'confirmed']

for h, v in zip(headers, values):
    print(f"{h}: {v}")
# name: Alice Chen
# amount: 1250.50
# region: West
# status: confirmed

# zip() creates tuples, dict() converts them to key-value pairs
record = dict(zip(headers, values))
# {'name': 'Alice Chen', 'amount': '1250.50', 'region': 'West', 'status': 'confirmed'}

The Whitespace Problem

Real CSV files are messy. Look at the test data:

name,amount,region,status
Alice Chen,1250.50,West,confirmed
Bob Kumar,340.50,East,pending
...Eve Williams,520.00,North,confirmed

Eve's name has a leading space. If you don't strip it, you'll get ' Eve Williams' (with the space). That breaks lookups.

Maya(excited): So I strip each field?

teacher (proud)

Yes! After you split, strip each value:

values = [v.strip() for v in line.split(',')]
# ['Eve Williams']  ✓ space is gone

Now the dict has clean data.

Your Challenge

Write parse_csv(filepath) that:

  1. Reads the file (you can reuse read_csv_lines() from yesterday, or inline it)
  2. Uses the first line as headers
  3. For each remaining line:
    • Strip the line
    • Skip empty lines
    • Split by comma
    • Strip each field
    • Zip with headers and create a dict
  4. Return a list of dicts

Test it with sales.csv. You should get 6 records (header + empty line skipped). Each record is a dict. You can access record['name'], record['amount'], etc.

Next time: String cleaning. Stripping spaces is just the start. Next we handle case normalization, quoted fields, and other CSV gotchas that make the real world messy.