Your thesis has an open-text question asking respondents to label their main concern. The responses arrive as messy strings like '{"theme": "workload", "detail": "too many exams"}'. Your advisor wants a list of just the theme labels. How do you extract them?
top_groups_by_score from yesterday handles structured fields. But free text is different — the themes are buried inside strings, not in named columns.
re.findall is the extractor. Give it a pattern, give it the text, get back a list of every match. r'"theme":\s*"([^"]+)"' captures the value between the quotes after "theme": — the ([^"]+) group means one or more characters that are not a quote:
import re
text = '{"theme": "workload", "detail": "exams"} {"theme": "commute"}'
themes = re.findall(r'"theme":\s*"([^"]+)"', text)
print(themes) # ['workload', 'commute']([^"]+) — the [^...] means "not these characters"? And the parentheses capture the match?
Exactly right. [^"]+ is a character class negation — match one or more characters that are not ". Wrapping it in () captures just that group. re.findall returns a list of all captured groups from the whole text:
def extract_free_text_themes(raw_text: str) -> list:
"""Extract theme labels from JSON-like survey free-text responses."""
import re
themes = re.findall(r'"theme":\s*"([^"]+)"', raw_text)
cleaned = [clean_response_text(t) for t in themes]
print(f"Found {len(cleaned)} themes")
return cleanedI'm using clean_response_text from Day 4 in the comprehension — normalising every extracted theme before returning the list.
Three weeks of functions. The Day 4 cleaner is still pulling its weight on Day 25.
I can feed the full open-text column from Qualtrics into this and get back a clean list of theme labels ready for frequency analysis.
Regex patterns are precise but brittle — a different formatting of the theme field will produce zero matches. Always test your pattern against at least three real samples from your actual data before relying on it in the pipeline.
import re
matches = re.findall(r'pattern', text)findall returns a list of all non-overlapping matches. With a capture group (...), it returns the captured content.
| Token | Meaning |
|---|---|
"theme": | Literal text |
\s* | Zero or more whitespace |
([^"]+) | One or more non-quote chars (captured) |
Structured field extraction (key: value patterns), phone/email patterns, code parsing. For simple splits and replaces, string methods are faster and clearer.
Your thesis has an open-text question asking respondents to label their main concern. The responses arrive as messy strings like '{"theme": "workload", "detail": "too many exams"}'. Your advisor wants a list of just the theme labels. How do you extract them?
top_groups_by_score from yesterday handles structured fields. But free text is different — the themes are buried inside strings, not in named columns.
re.findall is the extractor. Give it a pattern, give it the text, get back a list of every match. r'"theme":\s*"([^"]+)"' captures the value between the quotes after "theme": — the ([^"]+) group means one or more characters that are not a quote:
import re
text = '{"theme": "workload", "detail": "exams"} {"theme": "commute"}'
themes = re.findall(r'"theme":\s*"([^"]+)"', text)
print(themes) # ['workload', 'commute']([^"]+) — the [^...] means "not these characters"? And the parentheses capture the match?
Exactly right. [^"]+ is a character class negation — match one or more characters that are not ". Wrapping it in () captures just that group. re.findall returns a list of all captured groups from the whole text:
def extract_free_text_themes(raw_text: str) -> list:
"""Extract theme labels from JSON-like survey free-text responses."""
import re
themes = re.findall(r'"theme":\s*"([^"]+)"', raw_text)
cleaned = [clean_response_text(t) for t in themes]
print(f"Found {len(cleaned)} themes")
return cleanedI'm using clean_response_text from Day 4 in the comprehension — normalising every extracted theme before returning the list.
Three weeks of functions. The Day 4 cleaner is still pulling its weight on Day 25.
I can feed the full open-text column from Qualtrics into this and get back a clean list of theme labels ready for frequency analysis.
Regex patterns are precise but brittle — a different formatting of the theme field will produce zero matches. Always test your pattern against at least three real samples from your actual data before relying on it in the pipeline.
import re
matches = re.findall(r'pattern', text)findall returns a list of all non-overlapping matches. With a capture group (...), it returns the captured content.
| Token | Meaning |
|---|---|
"theme": | Literal text |
\s* | Zero or more whitespace |
([^"]+) | One or more non-quote chars (captured) |
Structured field extraction (key: value patterns), phone/email patterns, code parsing. For simple splits and replaces, string methods are faster and clearer.
You have an open-text question in your survey where respondents label their main concern as a JSON-like string. She needs to extract just the theme values for frequency analysis. Write `extract_free_text_themes(raw_text)` that uses `re.findall` to extract all theme labels from strings like `'{"theme": "workload"}'` and returns them as a cleaned list.
Tap each step for scaffolded hints.
No blank-editor panic.