Regular Expressions - search, findall, match & groups (Python Beginner Lesson #26)
Video: Regular Expressions - search, findall, match & groups (Python Beginner Lesson #26) by Taught by Celeste AI - AI Coding Coach
Python Regular Expressions: re.search, findall, sub, groups
re.search(pattern, text)finds first match.re.findallfinds all.re.subreplaces. Use raw strings (r"...") for patterns. Backslash sequences:\ddigit,\wword char,\swhitespace. Quantifiers:?*+{n,m}.
Regex is a compact language for describing string patterns. Python's re module covers it all.
search vs match vs findall
import re
text = "The quick brown fox jumps over the lazy dog"
# Search anywhere
result = re.search(r"fox", text)
print(result.group()) # 'fox'
print(result.start()) # 16
# Match only at the START
result = re.match(r"fox", text)
print(result) # None — doesn't start with "fox"
# Find all occurrences
phones = re.findall(r"\d{3}-\d{4}", "Call 555-1234 or 555-5678")
print(phones) # ['555-1234', '555-5678']
re.search— find first match anywhere; returns Match object orNone.re.match— like search but only at the start of the string.re.findall— list of all matches (or all groups, see below).re.finditer— same but yields Match objects (for groups, positions).
Use raw strings: r"..."
re.search(r"\d+", text) # YES
re.search("\d+", text) # works but DeprecationWarning in newer Python
re.search("\\d+", text) # also works but ugly
Always prefix with r. Otherwise Python tries to interpret backslash sequences before regex sees them.
Metacharacters
| Pattern | Meaning |
|---|---|
. |
Any single char (except newline) |
\d |
Digit [0-9] |
\D |
Not digit |
\w |
Word char [a-zA-Z0-9_] |
\W |
Not word char |
\s |
Whitespace |
\S |
Not whitespace |
\b |
Word boundary |
^ |
Start of string |
$ |
End of string |
text = "Order 42 shipped on 2024-01-15"
print(re.findall(r"\d+", text)) # ['42', '2024', '01', '15']
print(re.split(r"\s+", "hello world python"))
# ['hello', 'world', 'python']
Quantifiers
| Pattern | Meaning |
|---|---|
? |
0 or 1 |
* |
0 or more |
+ |
1 or more |
{n} |
exactly n |
{n,m} |
between n and m |
text = "color colour colouur"
re.findall(r"colou?r", text) # ['color', 'colour'] — 0 or 1 'u'
re.findall(r"colou+r", text) # ['colour', 'colouur'] — 1 or more
re.findall(r"colou*r", text) # ['color', 'colour', 'colouur']
re.findall(r"\b\d{3}\b", "1 22 333 4444") # ['333'] — exactly 3
re.findall(r"\b\d{2,4}\b", "1 22 333 4444") # ['22', '333', '4444']
\b is a word boundary — matches between a word char and a non-word char. Without it, \d{3} would match the first 3 digits of 4444.
Character classes
re.findall(r"[a-z]+", "Hello World") # ['ello', 'orld']
re.findall(r"[A-Z][a-z]+", "Hello World") # ['Hello', 'World']
# Negation: ^
re.findall(r"[^0-9]+", "abc 123 xyz") # ['abc ', ' xyz']
# Or alternation
re.findall(r"cat|dog", "I have a cat and a dog") # ['cat', 'dog']
[abc] matches one of a, b, or c. [^abc] matches anything except a, b, or c. Inside [], most metacharacters lose their special meaning — [.+] is literal dot and plus.
Groups: capturing parts of the match
text = "John is 30, Jane is 25"
# Use parentheses to capture
matches = re.findall(r"(\w+) is (\d+)", text)
print(matches)
# [('John', '30'), ('Jane', '25')]
# Or with search:
m = re.search(r"(\w+) is (\d+)", text)
print(m.group()) # 'John is 30' — full match
print(m.group(1)) # 'John' — first group
print(m.group(2)) # '30' — second group
print(m.groups()) # ('John', '30') — all groups
(...) captures. findall returns tuples when there are groups. search returns a Match; .group(i) for the i-th capture, .group() or .group(0) for the whole match.
Named groups
pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
m = re.search(pattern, "Today is 2024-03-15")
print(m.group("year")) # '2024'
print(m.group("month")) # '03'
print(m.group("day")) # '15'
print(m.groupdict()) # {'year': '2024', 'month': '03', 'day': '15'}
(?P<name>...) names a group. More readable for complex patterns; result is a dict via groupdict.
Substitution: re.sub
text = "Call 555-1234 or 555-5678"
cleaned = re.sub(r"\d{3}-\d{4}", "XXX-XXXX", text)
print(cleaned) # 'Call XXX-XXXX or XXX-XXXX'
re.sub(pattern, replacement, text) replaces all matches.
The replacement can reference groups:
# Reorder: "First Last" → "Last, First"
result = re.sub(r"(\w+) (\w+)", r"\2, \1", "John Smith")
print(result) # 'Smith, John'
\1, \2 reference captured groups. Or use a function:
result = re.sub(r"\d+", lambda m: str(int(m.group()) * 2), "1 2 3")
print(result) # '2 4 6'
The function receives the Match object and returns the replacement string.
Compile for reuse
pattern = re.compile(r"\b[A-Z][a-z]+\b")
names = pattern.findall("Alice met Bob and Charlie")
result = pattern.sub("NAME", "Alice met Bob")
re.compile(pattern) returns a Pattern object. Reusing it is faster than recompiling on every call. Use when you'll match the same pattern many times — especially in loops.
Anchors and lookarounds
# Start/end
re.findall(r"^\w+", "Hello world") # ['Hello']
re.findall(r"\w+$", "Hello world") # ['world']
# Lookahead: (?=...) — match if followed by, but don't consume
re.findall(r"\w+(?= dollars)", "100 dollars 200 euros")
# ['100']
# Negative lookahead: (?!...)
re.findall(r"\w+(?! dollars)", "100 dollars 200 euros")
# different result — careful with these
Lookarounds are zero-width assertions — they check but don't consume characters. Use sparingly; they make patterns hard to read.
Flags
re.findall(r"hello", "Hello WORLD hello", re.IGNORECASE)
# ['Hello', 'hello']
re.findall(r"^line", text, re.MULTILINE) # ^ matches start of any line
re.search(r"foo # comment\n.*bar", text, re.VERBOSE) # allow whitespace + comments in pattern
Common flags:
re.IGNORECASE— case insensitive.re.MULTILINE—^and$match at line boundaries.re.DOTALL—.matches newlines too.re.VERBOSE— let the pattern have whitespace and#comments for readability.
Combine with |: re.IGNORECASE | re.MULTILINE.
Greedy vs lazy
text = "<b>bold</b> and <i>italic</i>"
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>'] — greedy, matched entire string
re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>'] — lazy, minimal match
Quantifiers are greedy by default — match as much as possible. Add ? to make them lazy — match as little as possible.
For HTML-like content, lazy is usually what you want. (Better still, use a real parser like BeautifulSoup — regex doesn't handle nesting.)
A text processor
log_line = "2024-03-15 14:32:01 [ERROR] Failed to connect: timeout"
m = re.search(
r"(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<msg>.+)",
log_line
)
print(m.groupdict())
# {'date': '2024-03-15', 'time': '14:32:01', 'level': 'ERROR', 'msg': 'Failed to connect: timeout'}
Named groups + structured pattern = mini parser.
Common stumbles
Forgetting r"...". Backslash sequences get garbled. Always raw strings for patterns.
Greedy when you wanted lazy. <.+> over-matches. Use <.+?>.
Using regex for nested structures. HTML, JSON, XML — don't. Use a real parser. Regex can't count balanced parens.
re.match vs re.search. match requires start of string. search matches anywhere. Beginners mix them up.
Anchors only outside groups. ^abc$ works on the whole string by default — for line-by-line, add re.MULTILINE.
Special chars in []. Most metacharacters lose meaning inside character class — [.+] is literal. But ], ^ (at start), and - (between others) still need escape.
Forgetting to escape .. r"3.14" matches 3X14 too. Use r"3\.14".
What's next
Lesson 27: virtual environments and pip. venv, pip install, requirements.txt, the project setup workflow.
Recap
re.search for first match, re.findall for all, re.sub for replace, re.compile for reuse. Always raw strings (r"..."). Metacharacters: \d \w \s . *? + ? {n,m} ^ $ \b. [abc] and [^abc] for character classes. (...) to capture; (?P<name>...) to name. ? after +/* for lazy. Flags via re.IGNORECASE, re.MULTILINE, re.VERBOSE. Don't use regex for nested formats.
Next lesson: virtual environments and pip.