Part of Python for Beginners

Regular Expressions - search, findall, match & groups (Python Beginner Lesson #26)

Sandy Lane

•April 18, 2026

Video: Regular Expressions - search, findall, match & groups (Python Beginner Lesson #26) by Taught by Celeste AI - AI Coding Coach

Take the quiz on the full lesson page

Test what you've read · interactive walkthrough

Python Regular Expressions: re.search, findall, sub, groups

re.search(pattern, text) finds first match. re.findall finds all. re.sub replaces. Use raw strings (r"...") for patterns. Backslash sequences: \d digit, \w word char, \s whitespace. Quantifiers: ? * + {n,m}.

Regex is a compact language for describing string patterns. Python's re module covers it all.

search vs match vs findall

import re

text = "The quick brown fox jumps over the lazy dog"

# Search anywhere
result = re.search(r"fox", text)
print(result.group())     # 'fox'
print(result.start())     # 16

# Match only at the START
result = re.match(r"fox", text)
print(result)             # None — doesn't start with "fox"

# Find all occurrences
phones = re.findall(r"\d{3}-\d{4}", "Call 555-1234 or 555-5678")
print(phones)             # ['555-1234', '555-5678']

re.search — find first match anywhere; returns Match object or None.
re.match — like search but only at the start of the string.
re.findall — list of all matches (or all groups, see below).
re.finditer — same but yields Match objects (for groups, positions).

Use raw strings: r"..."

re.search(r"\d+", text)     # YES
re.search("\d+", text)      # works but DeprecationWarning in newer Python
re.search("\\d+", text)     # also works but ugly

Always prefix with r. Otherwise Python tries to interpret backslash sequences before regex sees them.

Metacharacters

Pattern	Meaning
`.`	Any single char (except newline)
`\d`	Digit `[0-9]`
`\D`	Not digit
`\w`	Word char `[a-zA-Z0-9_]`
`\W`	Not word char
`\s`	Whitespace
`\S`	Not whitespace
`\b`	Word boundary
`^`	Start of string
`$`	End of string

text = "Order 42 shipped on 2024-01-15"
print(re.findall(r"\d+", text))    # ['42', '2024', '01', '15']

print(re.split(r"\s+", "hello   world   python"))
# ['hello', 'world', 'python']

Quantifiers

Pattern	Meaning
`?`	0 or 1
`*`	0 or more
`+`	1 or more
`{n}`	exactly n
`{n,m}`	between n and m

text = "color colour colouur"

re.findall(r"colou?r", text)    # ['color', 'colour'] — 0 or 1 'u'
re.findall(r"colou+r", text)    # ['colour', 'colouur'] — 1 or more
re.findall(r"colou*r", text)    # ['color', 'colour', 'colouur']

re.findall(r"\b\d{3}\b", "1 22 333 4444")    # ['333'] — exactly 3
re.findall(r"\b\d{2,4}\b", "1 22 333 4444")  # ['22', '333', '4444']

\b is a word boundary — matches between a word char and a non-word char. Without it, \d{3} would match the first 3 digits of 4444.

Character classes

re.findall(r"[a-z]+", "Hello World")      # ['ello', 'orld']
re.findall(r"[A-Z][a-z]+", "Hello World")  # ['Hello', 'World']

# Negation: ^
re.findall(r"[^0-9]+", "abc 123 xyz")      # ['abc ', ' xyz']

# Or alternation
re.findall(r"cat|dog", "I have a cat and a dog")  # ['cat', 'dog']

[abc] matches one of a, b, or c. [^abc] matches anything except a, b, or c. Inside [], most metacharacters lose their special meaning — [.+] is literal dot and plus.

Groups: capturing parts of the match

text = "John is 30, Jane is 25"

# Use parentheses to capture
matches = re.findall(r"(\w+) is (\d+)", text)
print(matches)
# [('John', '30'), ('Jane', '25')]

# Or with search:
m = re.search(r"(\w+) is (\d+)", text)
print(m.group())     # 'John is 30' — full match
print(m.group(1))    # 'John' — first group
print(m.group(2))    # '30' — second group
print(m.groups())    # ('John', '30') — all groups

(...) captures. findall returns tuples when there are groups. search returns a Match; .group(i) for the i-th capture, .group() or .group(0) for the whole match.

Named groups

pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
m = re.search(pattern, "Today is 2024-03-15")

print(m.group("year"))    # '2024'
print(m.group("month"))   # '03'
print(m.group("day"))     # '15'
print(m.groupdict())      # {'year': '2024', 'month': '03', 'day': '15'}

(?P<name>...) names a group. More readable for complex patterns; result is a dict via groupdict.

Substitution: re.sub

text = "Call 555-1234 or 555-5678"
cleaned = re.sub(r"\d{3}-\d{4}", "XXX-XXXX", text)
print(cleaned)    # 'Call XXX-XXXX or XXX-XXXX'

re.sub(pattern, replacement, text) replaces all matches.

The replacement can reference groups:

# Reorder: "First Last" → "Last, First"
result = re.sub(r"(\w+) (\w+)", r"\2, \1", "John Smith")
print(result)    # 'Smith, John'

\1, \2 reference captured groups. Or use a function:

result = re.sub(r"\d+", lambda m: str(int(m.group()) * 2), "1 2 3")
print(result)    # '2 4 6'

The function receives the Match object and returns the replacement string.

Compile for reuse

pattern = re.compile(r"\b[A-Z][a-z]+\b")

names = pattern.findall("Alice met Bob and Charlie")
result = pattern.sub("NAME", "Alice met Bob")

re.compile(pattern) returns a Pattern object. Reusing it is faster than recompiling on every call. Use when you'll match the same pattern many times — especially in loops.

Anchors and lookarounds

# Start/end
re.findall(r"^\w+", "Hello world")        # ['Hello']
re.findall(r"\w+$", "Hello world")        # ['world']

# Lookahead: (?=...) — match if followed by, but don't consume
re.findall(r"\w+(?= dollars)", "100 dollars 200 euros")
# ['100']

# Negative lookahead: (?!...)
re.findall(r"\w+(?! dollars)", "100 dollars 200 euros")
# different result — careful with these

Lookarounds are zero-width assertions — they check but don't consume characters. Use sparingly; they make patterns hard to read.

Flags

re.findall(r"hello", "Hello WORLD hello", re.IGNORECASE)
# ['Hello', 'hello']

re.findall(r"^line", text, re.MULTILINE)    # ^ matches start of any line

re.search(r"foo  # comment\n.*bar", text, re.VERBOSE)    # allow whitespace + comments in pattern

Common flags:

re.IGNORECASE — case insensitive.
re.MULTILINE — ^ and $ match at line boundaries.
re.DOTALL — . matches newlines too.
re.VERBOSE — let the pattern have whitespace and # comments for readability.

Combine with |: re.IGNORECASE | re.MULTILINE.

Greedy vs lazy

text = "<b>bold</b> and <i>italic</i>"
re.findall(r"<.+>", text)
# ['<b>bold</b> and <i>italic</i>']  — greedy, matched entire string

re.findall(r"<.+?>", text)
# ['<b>', '</b>', '<i>', '</i>']  — lazy, minimal match

Quantifiers are greedy by default — match as much as possible. Add ? to make them lazy — match as little as possible.

For HTML-like content, lazy is usually what you want. (Better still, use a real parser like BeautifulSoup — regex doesn't handle nesting.)

A text processor

log_line = "2024-03-15 14:32:01 [ERROR] Failed to connect: timeout"

m = re.search(
  r"(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<msg>.+)",
  log_line
)
print(m.groupdict())
# {'date': '2024-03-15', 'time': '14:32:01', 'level': 'ERROR', 'msg': 'Failed to connect: timeout'}

Named groups + structured pattern = mini parser.

Common stumbles

Forgetting r"...". Backslash sequences get garbled. Always raw strings for patterns.

Greedy when you wanted lazy. <.+> over-matches. Use <.+?>.

Using regex for nested structures. HTML, JSON, XML — don't. Use a real parser. Regex can't count balanced parens.

re.match vs re.search. match requires start of string. search matches anywhere. Beginners mix them up.

Anchors only outside groups. ^abc$ works on the whole string by default — for line-by-line, add re.MULTILINE.

Special chars in []. Most metacharacters lose meaning inside character class — [.+] is literal. But ], ^ (at start), and - (between others) still need escape.

Forgetting to escape .. r"3.14" matches 3X14 too. Use r"3\.14".

What's next

Lesson 27: virtual environments and pip. venv, pip install, requirements.txt, the project setup workflow.

Recap

re.search for first match, re.findall for all, re.sub for replace, re.compile for reuse. Always raw strings (r"..."). Metacharacters: \d \w \s . *? + ? {n,m} ^ $ \b. [abc] and [^abc] for character classes. (...) to capture; (?P<name>...) to name. ? after +/* for lazy. Flags via re.IGNORECASE, re.MULTILINE, re.VERBOSE. Don't use regex for nested formats.

Next lesson: virtual environments and pip.

Ready? Take the quiz on the full lesson page →

Test what you've learned. Watch the lesson and try the interactive quiz on the same page.

View all episodes in Python for Beginners →