Web Scraping (BeautifulSoup, Selectors, Pagination) - Python Tutorial for Beginners #33
Video: Web Scraping (BeautifulSoup, Selectors, Pagination) - Python Tutorial for Beginners #33 by Taught by Celeste AI - AI Coding Coach
Python Web Scraping with BeautifulSoup
pip install beautifulsoup4 requests.BeautifulSoup(html, "html.parser")parses HTML..find(),.find_all()for tag lookup;.select(),.select_one()for CSS selectors. Always check the site'srobots.txtand terms — scraping has legal and ethical limits.
When a site doesn't have an API, scraping pulls structured data out of its HTML. BeautifulSoup makes the parsing easy.
Setup
pip install requests beautifulsoup4 lxml
requests fetches the page. beautifulsoup4 parses it. lxml is a fast parser backend (recommended).
Fetch and parse
import requests
from bs4 import BeautifulSoup
url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title.text) # "Quotes to Scrape"
response.content (bytes) is preferred over .text — BeautifulSoup handles encoding from there.
"html.parser" is the built-in. For better speed and HTML5 support, use "lxml" or "html5lib".
find() and find_all()
# First match
quote = soup.find("div", class_="quote")
# All matches
quotes = soup.find_all("div", class_="quote")
print(f"Found {len(quotes)} quotes")
# By id
header = soup.find(id="main-header")
# By multiple attributes
links = soup.find_all("a", attrs={"class": "external", "data-track": "yes"})
Note class_ (with trailing underscore) — class is a Python keyword.
find returns one element or None. find_all returns a list.
Extracting data from a tag
quote = soup.find("div", class_="quote")
text = quote.find("span", class_="text").text
author = quote.find("small", class_="author").text
print(f"{text} - {author}")
# All child links
for a in quote.find_all("a"):
print(a.get("href"))
.text— visible text inside the tag (recursively)..get("attr")ortag["attr"]— attribute value..string— text only if exactly one child text node (otherwiseNone)..contents— list of immediate children..find_parent,.find_next_sibling, etc. — DOM navigation.
CSS selectors: select() and select_one()
books = soup.select("article.product_pod") # all matches
first = soup.select_one("article.product_pod") # first match
for book in books[:5]:
title = book.select_one("h3 a")["title"]
price = book.select_one(".price_color").text
rating_cls = book.select_one(".star-rating")["class"][1]
CSS selectors are usually more concise than chained find calls.
| Selector | Meaning |
|---|---|
tag |
element by name |
.class |
element with class |
#id |
element with id |
tag.class |
both |
parent > child |
direct child |
parent child |
descendant |
[attr=value] |
by attribute |
tag:nth-of-type(2) |
second of type |
For complex selections, CSS selectors are usually more readable.
A complete book scraper
import requests
from bs4 import BeautifulSoup
response = requests.get("https://books.toscrape.com/")
soup = BeautifulSoup(response.content, "html.parser")
books = soup.select("article.product_pod")
print(f"Found {len(books)} books")
for book in books:
title = book.select_one("h3 a")["title"]
price = book.select_one(".price_color").text
rating = book.select_one(".star-rating")["class"][1]
print(f"{rating:5s} | {price:7s} | {title}")
# Aggregate
prices = [
float(b.select_one(".price_color").text[1:])
for b in books
]
print(f"Average price: £{sum(prices) / len(prices):.2f}")
The whole scraper in 15 lines. Real scrapers handle pagination, errors, and rate limiting.
Handling pagination
import requests
from bs4 import BeautifulSoup
base = "https://quotes.toscrape.com"
url = base + "/"
all_quotes = []
while url:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for q in soup.find_all("div", class_="quote"):
all_quotes.append({
"text": q.find("span", class_="text").text,
"author": q.find("small", class_="author").text,
})
next_link = soup.find("li", class_="next")
url = base + next_link.find("a")["href"] if next_link else None
print(f"Scraped {len(all_quotes)} quotes")
Loop until no "next" link. Append to a master list. Standard pattern.
Be polite
import time
for url in urls:
response = requests.get(url, headers={"User-Agent": "myapp/1.0"})
time.sleep(1) # be nice — don't hammer the server
Rules of polite scraping:
- Set a
User-Agent. Identify your scraper. Don't pretend to be a browser. - Throttle.
time.sleep(1)between requests. Faster scraping wastes their bandwidth. - Cache locally during development. Don't re-fetch the same page 50 times while debugging.
- Respect robots.txt. If
Disallow: /scrape, don't scrape that path. - Check the ToS. Many sites prohibit scraping. Some allow it but require attribution.
- Use the API if there is one. Structured data, rate limits, lower load.
Robust selectors
Sites change their HTML. Selectors that break easily:
# Fragile
soup.select_one("body > div:nth-child(3) > div > article > div > h2")
# Better
soup.select_one(".product-card .product-name")
# Best
soup.select_one('[data-test="product-name"]')
Use semantic class names or data-* attributes when available. Avoid relying on positional selectors.
Headers and cookies
session = requests.Session()
session.headers.update({
"User-Agent": "myapp/1.0",
"Accept-Language": "en-US,en;q=0.9",
})
# Login (if needed)
session.post("https://example.com/login", data={"user": "...", "pass": "..."})
# Now scrape with the logged-in session
response = session.get("https://example.com/profile")
A session keeps cookies across requests — like a logged-in browser tab.
JavaScript-rendered pages
BeautifulSoup parses HTML — it doesn't run JavaScript. If the data is loaded dynamically by JS:
# requests + soup gets you the empty shell:
soup.find("div", id="results") # empty in source
# Real options:
# 1. Find the underlying API the JS is calling (often easier).
# 2. Use Selenium or Playwright — real browser automation.
# 3. Use requests-html or similar JS-aware scrapers.
Modern sites are mostly JS-rendered. Open the Network tab in DevTools — often the JS just calls a JSON API you can hit directly.
Cleaning extracted data
text = quote.text.strip() # whitespace
text = " ".join(text.split()) # collapse whitespace
text = text.replace("“", '"') # smart quotes → ASCII
import re
price = re.search(r"\d+\.\d+", price_text).group()
Scraped text is messy. Strip whitespace. Normalize Unicode. Use regex to extract numbers from "$12.99 each".
Saving the results
import csv
with open("quotes.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["author", "text"])
writer.writeheader()
for q in quotes:
writer.writerow(q)
# Or JSON
import json
with open("quotes.json", "w") as f:
json.dump(quotes, f, indent=2, ensure_ascii=False)
CSV for spreadsheet tools, JSON for further programmatic use, SQLite for serious work.
Common stumbles
class keyword. Use class_="..." in find_all. CSS selectors don't have this issue.
.text on None. When .find() returns None (no match), .text raises AttributeError. Check first or wrap in try.
Site requires JS. BeautifulSoup sees only the initial HTML. Use Playwright/Selenium or find the underlying API.
Aggressive scraping. Hammering a site → IP ban or worse. Throttle. Identify yourself.
Brittle selectors. Position-based or auto-generated class names break. Prefer stable identifiers.
Encoding mistakes. Use response.content (bytes) and let BeautifulSoup detect encoding. Or set response.encoding = "utf-8" if needed.
Ignoring robots.txt. Legal gray area; ethical clearly wrong if disallowed. Check before scraping.
Storing without dedup. Multi-page scraping can pick up the same item twice. Track IDs.
What's next
Lesson 34: SQLite Part 1. Embedded SQL database — sqlite3 module, connect, execute, fetchall.
Recap
BeautifulSoup(html, "html.parser") to parse. find() / find_all() for tag-based lookup; select() / select_one() for CSS selectors. .text for visible text, ["attr"] for attributes. Be polite: User-Agent, throttle, respect robots.txt. For JS-rendered sites, find the API or use a browser automation tool. Always handle missing elements (.find can return None).
Next lesson: SQLite Part 1.