A regex (regular expression) is a tiny pattern language for matching shapes in strings. The simplest regex is a literal string — but the power comes from the special characters.
import re
re.findall(r"\d+", "a1 b22 c333") # ['1', '22', '333']\d is a digit, + means one or more. r"..." is a raw string so the backslash isn't escaped.
All correct. r"\d+" reads: "one or more digit characters in a row." findall returns every non-overlapping match.
What's the difference between re.search and re.findall?
Different return shapes:
| Function | Returns | When to use |
|---|---|---|
re.search(pat, text) | a Match object (or None) for the first match | "is the pattern in there at all?" |
re.findall(pat, text) | a list of every match | "give me all of them" |
re.match(pat, text) | Match only if pattern matches at the start | rarely what you want — prefer search with ^ anchor |
A Match object holds .group() for the matched text and .start() / .end() for positions.
And the special characters — there's a whole language?
A small one. The pieces you'll use 90% of the time:
| Piece | Matches |
|---|---|
\d | one digit |
\w | one word character (letter, digit, underscore) |
\s | one whitespace character |
. | any single character |
+ | one or more of the previous |
* | zero or more of the previous |
? | zero or one of the previous |
[abc] | any one of a, b, c |
^, $ | start, end of string |
More than enough for most parsing tasks.
import re — the regex moduleimport re
re.search(pattern, text) # first match (or None)
re.findall(pattern, text) # all matches as list
re.sub(pattern, replacement, text) # replace all matchesr"...") for patternsRegex patterns use backslashes liberally (\d, \w, \.). Plain Python strings interpret backslashes too — "\d" works by accident, but "\b" doesn't (\b is a backspace character to Python). Raw strings turn that off:
re.findall(r"\d+", text) # GOOD
re.findall("\d+", text) # works in many cases, breaks unpredictablyr"\d" # one digit 0-9
r"\D" # one NON-digit
r"\w" # one word char a-z A-Z 0-9 _
r"\W" # one NON-word char
r"\s" # one whitespace char space, tab, newline
r"\S" # one NON-whitespace char
r"." # any character (except newline)
r"[abc]" # one of a, b, c
r"[a-z]" # one lowercase letter
r"[^abc]" # one character that is NOT a, b, cr"a+" # one or more 'a'
r"a*" # zero or more 'a'
r"a?" # zero or one 'a'
r"a{3}" # exactly 3 'a's
r"a{2,4}" # 2 to 4 'a'sQuantifiers attach to the previous element. \d+ means one or more digits.
r"^abc" # 'abc' at start of string
r"abc$" # 'abc' at end of string
r"^abc$" # whole string is exactly 'abc'Parentheses capture a sub-match:
m = re.search(r"(\d+)-(\d+)", "call 555-1234 today")
m.group(0) # '555-1234' — the whole match
m.group(1) # '555' — first capture group
m.group(2) # '1234' — second capture groupre.findall with groups returns tuples:
re.findall(r"(\d+)-(\d+)", "555-1234 and 800-5555")
# [('555', '1234'), ('800', '5555')]re.sub(r"\d+", "X", "a1 b22 c333") # 'aX bX cX'. doesn't match newline by default. Pass flags=re.DOTALL if you need it to.re.match only matches at start of string. Use re.search (with ^ if you want to anchor) to avoid surprise.r"a.+b" against "axxxbxxxb" matches the whole string (greedy), not the shortest match. Add ? for non-greedy: r"a.+?b".str.split is faster and clearer."abc" in text membership check.A regex (regular expression) is a tiny pattern language for matching shapes in strings. The simplest regex is a literal string — but the power comes from the special characters.
import re
re.findall(r"\d+", "a1 b22 c333") # ['1', '22', '333']\d is a digit, + means one or more. r"..." is a raw string so the backslash isn't escaped.
All correct. r"\d+" reads: "one or more digit characters in a row." findall returns every non-overlapping match.
What's the difference between re.search and re.findall?
Different return shapes:
| Function | Returns | When to use |
|---|---|---|
re.search(pat, text) | a Match object (or None) for the first match | "is the pattern in there at all?" |
re.findall(pat, text) | a list of every match | "give me all of them" |
re.match(pat, text) | Match only if pattern matches at the start | rarely what you want — prefer search with ^ anchor |
A Match object holds .group() for the matched text and .start() / .end() for positions.
And the special characters — there's a whole language?
A small one. The pieces you'll use 90% of the time:
| Piece | Matches |
|---|---|
\d | one digit |
\w | one word character (letter, digit, underscore) |
\s | one whitespace character |
. | any single character |
+ | one or more of the previous |
* | zero or more of the previous |
? | zero or one of the previous |
[abc] | any one of a, b, c |
^, $ | start, end of string |
More than enough for most parsing tasks.
import re — the regex moduleimport re
re.search(pattern, text) # first match (or None)
re.findall(pattern, text) # all matches as list
re.sub(pattern, replacement, text) # replace all matchesr"...") for patternsRegex patterns use backslashes liberally (\d, \w, \.). Plain Python strings interpret backslashes too — "\d" works by accident, but "\b" doesn't (\b is a backspace character to Python). Raw strings turn that off:
re.findall(r"\d+", text) # GOOD
re.findall("\d+", text) # works in many cases, breaks unpredictablyr"\d" # one digit 0-9
r"\D" # one NON-digit
r"\w" # one word char a-z A-Z 0-9 _
r"\W" # one NON-word char
r"\s" # one whitespace char space, tab, newline
r"\S" # one NON-whitespace char
r"." # any character (except newline)
r"[abc]" # one of a, b, c
r"[a-z]" # one lowercase letter
r"[^abc]" # one character that is NOT a, b, cr"a+" # one or more 'a'
r"a*" # zero or more 'a'
r"a?" # zero or one 'a'
r"a{3}" # exactly 3 'a's
r"a{2,4}" # 2 to 4 'a'sQuantifiers attach to the previous element. \d+ means one or more digits.
r"^abc" # 'abc' at start of string
r"abc$" # 'abc' at end of string
r"^abc$" # whole string is exactly 'abc'Parentheses capture a sub-match:
m = re.search(r"(\d+)-(\d+)", "call 555-1234 today")
m.group(0) # '555-1234' — the whole match
m.group(1) # '555' — first capture group
m.group(2) # '1234' — second capture groupre.findall with groups returns tuples:
re.findall(r"(\d+)-(\d+)", "555-1234 and 800-5555")
# [('555', '1234'), ('800', '5555')]re.sub(r"\d+", "X", "a1 b22 c333") # 'aX bX cX'. doesn't match newline by default. Pass flags=re.DOTALL if you need it to.re.match only matches at start of string. Use re.search (with ^ if you want to anchor) to avoid surprise.r"a.+b" against "axxxbxxxb" matches the whole string (greedy), not the shortest match. Add ? for non-greedy: r"a.+?b".str.split is faster and clearer."abc" in text membership check.Create a free account to get started. Paid plans unlock all tracks.