The archive system exports respondent IDs embedded in a JSON-like text dump — {"id": "R_001", ...} repeated for each row. You need just the IDs. How do you extract them without parsing the full JSON?
top_respondents_by_outcome from Day 24 showed r.get("id"). But if I only have a raw text dump with no proper JSON structure, I can't use json.loads. I'd need to search for the pattern "id": "R_..." directly in the string.
That's exactly what regular expressions do. re.findall with a pattern returns every match as a list. The pattern r'"id":\s*"([^"]+)"' finds every occurrence of "id": "some_value" and captures just the value:
import re
raw = '{"id": "R_001", "age": 29} {"id": "R_002", "age": 22}'
ids = re.findall(r'"id":\s*"([^"]+)"', raw)
# ids = ['R_001', 'R_002']What does ([^"]+) mean in the pattern? The brackets look like a list but it can't be.
[^"]+ is a character class. The caret ^ inside brackets means "not". So [^"] matches any character except a double-quote. The + means "one or more". The parentheses () capture the match so findall returns just the captured group, not the whole match. Together: "one or more non-quote characters, captured":
import re
def extract_respondent_ids(raw_text: str) -> list:
ids = re.findall(r'"id":\s*"([^"]+)"', raw_text)
cleaned = [clean_group_label(i) for i in ids]
print(f"Extracted {len(cleaned)} IDs")
return cleanedSo re.findall scans the entire text and returns all matches at once? No loop needed?
findall handles the loop internally. You declare the pattern, not the iteration.
Regex is the part of Python I've been afraid of. But this pattern just reads as "find everything that looks like an id field."
That instinct is right — regex is readable once you learn the handful of metacharacters. The traps: greedy vs lazy matching (+ vs +?), and forgetting the r prefix on the pattern string so backslashes aren't treated as Python escape sequences.
re.findall(pattern, text) returns a list of all non-overlapping matches.
import re
ids = re.findall(r'"id":\s*"([^"]+)"', raw_text)| Part | Meaning |
|---|---|
"id": | literal string |
\s* | zero or more whitespace |
" | literal quote |
([^"]+) | capture group: one or more non-quote chars |
re.search(pattern, text) returns the first match object (or None). re.findall returns a list of all match strings. Use search when you need the first occurrence; findall when you need all of them.
Always use r"pattern" for regex patterns — the r prevents Python from processing backslashes before the regex engine sees them.
The archive system exports respondent IDs embedded in a JSON-like text dump — {"id": "R_001", ...} repeated for each row. You need just the IDs. How do you extract them without parsing the full JSON?
top_respondents_by_outcome from Day 24 showed r.get("id"). But if I only have a raw text dump with no proper JSON structure, I can't use json.loads. I'd need to search for the pattern "id": "R_..." directly in the string.
That's exactly what regular expressions do. re.findall with a pattern returns every match as a list. The pattern r'"id":\s*"([^"]+)"' finds every occurrence of "id": "some_value" and captures just the value:
import re
raw = '{"id": "R_001", "age": 29} {"id": "R_002", "age": 22}'
ids = re.findall(r'"id":\s*"([^"]+)"', raw)
# ids = ['R_001', 'R_002']What does ([^"]+) mean in the pattern? The brackets look like a list but it can't be.
[^"]+ is a character class. The caret ^ inside brackets means "not". So [^"] matches any character except a double-quote. The + means "one or more". The parentheses () capture the match so findall returns just the captured group, not the whole match. Together: "one or more non-quote characters, captured":
import re
def extract_respondent_ids(raw_text: str) -> list:
ids = re.findall(r'"id":\s*"([^"]+)"', raw_text)
cleaned = [clean_group_label(i) for i in ids]
print(f"Extracted {len(cleaned)} IDs")
return cleanedSo re.findall scans the entire text and returns all matches at once? No loop needed?
findall handles the loop internally. You declare the pattern, not the iteration.
Regex is the part of Python I've been afraid of. But this pattern just reads as "find everything that looks like an id field."
That instinct is right — regex is readable once you learn the handful of metacharacters. The traps: greedy vs lazy matching (+ vs +?), and forgetting the r prefix on the pattern string so backslashes aren't treated as Python escape sequences.
re.findall(pattern, text) returns a list of all non-overlapping matches.
import re
ids = re.findall(r'"id":\s*"([^"]+)"', raw_text)| Part | Meaning |
|---|---|
"id": | literal string |
\s* | zero or more whitespace |
" | literal quote |
([^"]+) | capture group: one or more non-quote chars |
re.search(pattern, text) returns the first match object (or None). re.findall returns a list of all match strings. Use search when you need the first occurrence; findall when you need all of them.
Always use r"pattern" for regex patterns — the r prevents Python from processing backslashes before the regex engine sees them.
Nadia received a raw text dump from the archive system containing respondent records as JSON-like strings. Write `extract_respondent_ids(raw_text)` that uses `re.findall` to extract all respondent ID values from patterns like `"id": "R_001"` and returns a list of cleaned ID strings.
Tap each step for scaffolded hints.
No blank-editor panic.