Python String Methods: strip, replace, upper, lower for Data Cleaning
Clean messy string data using Python's string methods. Learn strip, upper, lower, and title to normalize names, regions, and status fields.
Hold on. Before you say anything — I already know the problem.
OK, I'm listening.
Look at the records. 'West' and 'west' are the same region but Python doesn't know that. And the amounts... some have money and some don't, and some are probably broken. But the thing that's really bugging me is the spaces. Look:
records = parse_csv('sales.csv')
for r in records:
print(f"Name: '{r['name']}' | Region: '{r['region']}' | Status: '{r['status']}'")
If you run that, you'll see it. Some names have trailing spaces. 'Alice Chen' vs ' Alice Chen '. Same person. Three different representations. That's the problem.
You're exactly right. This is why data cleaning is the first real job a programmer does. Once you parse the data, it's still messy. Welcome to every real-world dataset ever.
I didn't even get to set up the problem.
Maya beat you to it. This is what Week 4 is all about. Yesterday you parsed the CSV. Today you clean it.
The Cleaning Pipeline
You've got six records from the CSV. They look right, but they're not. Here's what we need to fix:
- Names: Strip spaces, title-case them. ' alice chen ' becomes 'Alice Chen'
- Amounts: Convert to float. If empty or invalid, use 0.0
- Regions: Strip spaces, upper-case them. 'west' becomes 'WEST'
- Status: Strip spaces, lower-case them. 'Confirmed' becomes 'confirmed'
Python has string methods that do each of these. A string method is a function built into every string that you call with a dot:
text = ' hello '
text.strip() # ' hello ' → 'hello'
text.upper() # 'hello' → 'HELLO'
text.lower() # 'HELLO' → 'hello'
text.title() # 'hello world' → 'Hello World'
strip() — Remove Leading and Trailing Whitespace
Imagine a name came from the CSV with spaces:
name = ' Alice Chen '
name.strip() # 'Alice Chen'
That's all it does. It removes spaces (and tabs, newlines, anything whitespace) from the beginning and end. It doesn't touch spaces inside the string.
text = ' hello world '
text.strip() # 'hello world' — spaces inside stay
So if I want to clean a name, I call .strip() first?
Yes, and then chain other methods on the result:
name = ' Alice Chen '
cleaned = name.strip().title() # 'Alice Chen'
This is called "chaining." Each method returns a string, so you can call another method on that result.
upper() and lower() — Normalize Case
For regions, we want consistent casing. Whether the CSV says 'west', 'West', or 'WEST', we normalize to 'WEST':
region = 'west'
region.upper() # 'WEST'
region = 'West'
region.upper() # 'WEST'
For status, we normalize to lowercase:
status = 'Confirmed'
status.lower() # 'confirmed'
status = 'CONFIRMED'
status.lower() # 'confirmed'
So case normalisation sounds very corporate for 'make the letters the same size.'
That's exactly what it is. In a real company, you'd call it "standardization" in a meeting, but yeah — you're just making the letters line up so comparisons work.
title() — Capitalize Each Word
Names look better when each word is capitalized:
name = 'alice chen'
name.title() # 'Alice Chen'
Useful for people's names or job titles.
Putting It Together: clean_record()
Now we write a function that cleans all four fields:
def clean_record(record):
name = record.get('name', '').strip().title()
amount_str = record.get('amount', '').strip()
try:
amount = float(amount_str) if amount_str else 0.0
except ValueError:
amount = 0.0
region = record.get('region', '').strip().upper()
status = record.get('status', '').strip().lower()
return {
'name': name,
'amount': amount,
'region': region,
'status': status
}
Let's trace through one record:
raw_record = {'name': ' alice chen ', 'amount': '1250.50', 'region': 'west', 'status': 'Confirmed'}
cleaned = clean_record(raw_record)
print(cleaned)
# {'name': 'Alice Chen', 'amount': 1250.5, 'region': 'WEST', 'status': 'confirmed'}
Notice the amount is now a float (1250.5, not '1250.50'). If the amount field was empty:
raw_record = {'name': 'Bob', 'amount': '', 'region': 'east', 'status': 'pending'}
cleaned = clean_record(raw_record)
print(cleaned['amount']) # 0.0
If it's invalid (non-numeric):
raw_record = {'name': 'Carol', 'amount': 'N/A', 'region': 'north', 'status': 'pending'}
cleaned = clean_record(raw_record)
print(cleaned['amount']) # 0.0 (safe default)
Your Challenge
Write clean_record(record) that takes a dict (from parse_csv()) and returns a cleaned dict with:
name: stripped and title-casedamount: converted to float (0.0 if empty or invalid)region: stripped and upper-casedstatus: stripped and lower-cased
Test it with a few records from parse_csv('sales.csv'):
records = parse_csv('sales.csv')
for record in records:
cleaned = clean_record(record)
print(cleaned)
You've got clean data now. But it's still just sitting in a list. Next lesson we write it back to a file — because clean data is only useful if you can get it out.
Next time: Writing CSV files with the cleaned data. Once you parse and clean, you export.
Practice your skills
Sign up to write and run code in this lesson.
Python String Methods: strip, replace, upper, lower for Data Cleaning
Clean messy string data using Python's string methods. Learn strip, upper, lower, and title to normalize names, regions, and status fields.
Hold on. Before you say anything — I already know the problem.
OK, I'm listening.
Look at the records. 'West' and 'west' are the same region but Python doesn't know that. And the amounts... some have money and some don't, and some are probably broken. But the thing that's really bugging me is the spaces. Look:
records = parse_csv('sales.csv')
for r in records:
print(f"Name: '{r['name']}' | Region: '{r['region']}' | Status: '{r['status']}'")
If you run that, you'll see it. Some names have trailing spaces. 'Alice Chen' vs ' Alice Chen '. Same person. Three different representations. That's the problem.
You're exactly right. This is why data cleaning is the first real job a programmer does. Once you parse the data, it's still messy. Welcome to every real-world dataset ever.
I didn't even get to set up the problem.
Maya beat you to it. This is what Week 4 is all about. Yesterday you parsed the CSV. Today you clean it.
The Cleaning Pipeline
You've got six records from the CSV. They look right, but they're not. Here's what we need to fix:
- Names: Strip spaces, title-case them. ' alice chen ' becomes 'Alice Chen'
- Amounts: Convert to float. If empty or invalid, use 0.0
- Regions: Strip spaces, upper-case them. 'west' becomes 'WEST'
- Status: Strip spaces, lower-case them. 'Confirmed' becomes 'confirmed'
Python has string methods that do each of these. A string method is a function built into every string that you call with a dot:
text = ' hello '
text.strip() # ' hello ' → 'hello'
text.upper() # 'hello' → 'HELLO'
text.lower() # 'HELLO' → 'hello'
text.title() # 'hello world' → 'Hello World'
strip() — Remove Leading and Trailing Whitespace
Imagine a name came from the CSV with spaces:
name = ' Alice Chen '
name.strip() # 'Alice Chen'
That's all it does. It removes spaces (and tabs, newlines, anything whitespace) from the beginning and end. It doesn't touch spaces inside the string.
text = ' hello world '
text.strip() # 'hello world' — spaces inside stay
So if I want to clean a name, I call .strip() first?
Yes, and then chain other methods on the result:
name = ' Alice Chen '
cleaned = name.strip().title() # 'Alice Chen'
This is called "chaining." Each method returns a string, so you can call another method on that result.
upper() and lower() — Normalize Case
For regions, we want consistent casing. Whether the CSV says 'west', 'West', or 'WEST', we normalize to 'WEST':
region = 'west'
region.upper() # 'WEST'
region = 'West'
region.upper() # 'WEST'
For status, we normalize to lowercase:
status = 'Confirmed'
status.lower() # 'confirmed'
status = 'CONFIRMED'
status.lower() # 'confirmed'
So case normalisation sounds very corporate for 'make the letters the same size.'
That's exactly what it is. In a real company, you'd call it "standardization" in a meeting, but yeah — you're just making the letters line up so comparisons work.
title() — Capitalize Each Word
Names look better when each word is capitalized:
name = 'alice chen'
name.title() # 'Alice Chen'
Useful for people's names or job titles.
Putting It Together: clean_record()
Now we write a function that cleans all four fields:
def clean_record(record):
name = record.get('name', '').strip().title()
amount_str = record.get('amount', '').strip()
try:
amount = float(amount_str) if amount_str else 0.0
except ValueError:
amount = 0.0
region = record.get('region', '').strip().upper()
status = record.get('status', '').strip().lower()
return {
'name': name,
'amount': amount,
'region': region,
'status': status
}
Let's trace through one record:
raw_record = {'name': ' alice chen ', 'amount': '1250.50', 'region': 'west', 'status': 'Confirmed'}
cleaned = clean_record(raw_record)
print(cleaned)
# {'name': 'Alice Chen', 'amount': 1250.5, 'region': 'WEST', 'status': 'confirmed'}
Notice the amount is now a float (1250.5, not '1250.50'). If the amount field was empty:
raw_record = {'name': 'Bob', 'amount': '', 'region': 'east', 'status': 'pending'}
cleaned = clean_record(raw_record)
print(cleaned['amount']) # 0.0
If it's invalid (non-numeric):
raw_record = {'name': 'Carol', 'amount': 'N/A', 'region': 'north', 'status': 'pending'}
cleaned = clean_record(raw_record)
print(cleaned['amount']) # 0.0 (safe default)
Your Challenge
Write clean_record(record) that takes a dict (from parse_csv()) and returns a cleaned dict with:
name: stripped and title-casedamount: converted to float (0.0 if empty or invalid)region: stripped and upper-casedstatus: stripped and lower-cased
Test it with a few records from parse_csv('sales.csv'):
records = parse_csv('sales.csv')
for record in records:
cleaned = clean_record(record)
print(cleaned)
You've got clean data now. But it's still just sitting in a list. Next lesson we write it back to a file — because clean data is only useful if you can get it out.
Next time: Writing CSV files with the cleaned data. Once you parse and clean, you export.