Part of Python for Beginners

Web Scraping (BeautifulSoup, Selectors, Pagination) - Python Tutorial for Beginners #33

Sandy LaneSandy Lane

Video: Web Scraping (BeautifulSoup, Selectors, Pagination) - Python Tutorial for Beginners #33 by Taught by Celeste AI - AI Coding Coach

Take the quiz on the full lesson page
Test what you've read · interactive walkthrough

Python Web Scraping with BeautifulSoup

pip install beautifulsoup4 requests. BeautifulSoup(html, "html.parser") parses HTML. .find(), .find_all() for tag lookup; .select(), .select_one() for CSS selectors. Always check the site's robots.txt and terms — scraping has legal and ethical limits.

When a site doesn't have an API, scraping pulls structured data out of its HTML. BeautifulSoup makes the parsing easy.

Setup

pip install requests beautifulsoup4 lxml

requests fetches the page. beautifulsoup4 parses it. lxml is a fast parser backend (recommended).

Fetch and parse

import requests
from bs4 import BeautifulSoup

url = "https://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

print(soup.title.text)   # "Quotes to Scrape"

response.content (bytes) is preferred over .text — BeautifulSoup handles encoding from there.

"html.parser" is the built-in. For better speed and HTML5 support, use "lxml" or "html5lib".

find() and find_all()

# First match
quote = soup.find("div", class_="quote")

# All matches
quotes = soup.find_all("div", class_="quote")
print(f"Found {len(quotes)} quotes")

# By id
header = soup.find(id="main-header")

# By multiple attributes
links = soup.find_all("a", attrs={"class": "external", "data-track": "yes"})

Note class_ (with trailing underscore) — class is a Python keyword.

find returns one element or None. find_all returns a list.

Extracting data from a tag

quote = soup.find("div", class_="quote")

text = quote.find("span", class_="text").text
author = quote.find("small", class_="author").text
print(f"{text} - {author}")

# All child links
for a in quote.find_all("a"):
  print(a.get("href"))
  • .text — visible text inside the tag (recursively).
  • .get("attr") or tag["attr"] — attribute value.
  • .string — text only if exactly one child text node (otherwise None).
  • .contents — list of immediate children.
  • .find_parent, .find_next_sibling, etc. — DOM navigation.

CSS selectors: select() and select_one()

books = soup.select("article.product_pod")        # all matches
first = soup.select_one("article.product_pod")    # first match

for book in books[:5]:
  title = book.select_one("h3 a")["title"]
  price = book.select_one(".price_color").text
  rating_cls = book.select_one(".star-rating")["class"][1]

CSS selectors are usually more concise than chained find calls.

Selector Meaning
tag element by name
.class element with class
#id element with id
tag.class both
parent > child direct child
parent child descendant
[attr=value] by attribute
tag:nth-of-type(2) second of type

For complex selections, CSS selectors are usually more readable.

A complete book scraper

import requests
from bs4 import BeautifulSoup

response = requests.get("https://books.toscrape.com/")
soup = BeautifulSoup(response.content, "html.parser")

books = soup.select("article.product_pod")
print(f"Found {len(books)} books")

for book in books:
  title = book.select_one("h3 a")["title"]
  price = book.select_one(".price_color").text
  rating = book.select_one(".star-rating")["class"][1]
  print(f"{rating:5s} | {price:7s} | {title}")

# Aggregate
prices = [
  float(b.select_one(".price_color").text[1:])
  for b in books
]
print(f"Average price: £{sum(prices) / len(prices):.2f}")

The whole scraper in 15 lines. Real scrapers handle pagination, errors, and rate limiting.

Handling pagination

import requests
from bs4 import BeautifulSoup

base = "https://quotes.toscrape.com"
url = base + "/"
all_quotes = []

while url:
  response = requests.get(url)
  soup = BeautifulSoup(response.content, "html.parser")

  for q in soup.find_all("div", class_="quote"):
    all_quotes.append({
      "text": q.find("span", class_="text").text,
      "author": q.find("small", class_="author").text,
    })

  next_link = soup.find("li", class_="next")
  url = base + next_link.find("a")["href"] if next_link else None

print(f"Scraped {len(all_quotes)} quotes")

Loop until no "next" link. Append to a master list. Standard pattern.

Be polite

import time

for url in urls:
  response = requests.get(url, headers={"User-Agent": "myapp/1.0"})
  time.sleep(1)    # be nice — don't hammer the server

Rules of polite scraping:

  • Set a User-Agent. Identify your scraper. Don't pretend to be a browser.
  • Throttle. time.sleep(1) between requests. Faster scraping wastes their bandwidth.
  • Cache locally during development. Don't re-fetch the same page 50 times while debugging.
  • Respect robots.txt. If Disallow: /scrape, don't scrape that path.
  • Check the ToS. Many sites prohibit scraping. Some allow it but require attribution.
  • Use the API if there is one. Structured data, rate limits, lower load.

Robust selectors

Sites change their HTML. Selectors that break easily:

# Fragile
soup.select_one("body > div:nth-child(3) > div > article > div > h2")

# Better
soup.select_one(".product-card .product-name")

# Best
soup.select_one('[data-test="product-name"]')

Use semantic class names or data-* attributes when available. Avoid relying on positional selectors.

Headers and cookies

session = requests.Session()
session.headers.update({
  "User-Agent": "myapp/1.0",
  "Accept-Language": "en-US,en;q=0.9",
})

# Login (if needed)
session.post("https://example.com/login", data={"user": "...", "pass": "..."})

# Now scrape with the logged-in session
response = session.get("https://example.com/profile")

A session keeps cookies across requests — like a logged-in browser tab.

JavaScript-rendered pages

BeautifulSoup parses HTML — it doesn't run JavaScript. If the data is loaded dynamically by JS:

# requests + soup gets you the empty shell:
soup.find("div", id="results")    # empty in source

# Real options:
# 1. Find the underlying API the JS is calling (often easier).
# 2. Use Selenium or Playwright — real browser automation.
# 3. Use requests-html or similar JS-aware scrapers.

Modern sites are mostly JS-rendered. Open the Network tab in DevTools — often the JS just calls a JSON API you can hit directly.

Cleaning extracted data

text = quote.text.strip()                    # whitespace
text = " ".join(text.split())                # collapse whitespace
text = text.replace("“", '"')           # smart quotes → ASCII

import re
price = re.search(r"\d+\.\d+", price_text).group()

Scraped text is messy. Strip whitespace. Normalize Unicode. Use regex to extract numbers from "$12.99 each".

Saving the results

import csv

with open("quotes.csv", "w", newline="") as f:
  writer = csv.DictWriter(f, fieldnames=["author", "text"])
  writer.writeheader()
  for q in quotes:
    writer.writerow(q)

# Or JSON
import json
with open("quotes.json", "w") as f:
  json.dump(quotes, f, indent=2, ensure_ascii=False)

CSV for spreadsheet tools, JSON for further programmatic use, SQLite for serious work.

Common stumbles

class keyword. Use class_="..." in find_all. CSS selectors don't have this issue.

.text on None. When .find() returns None (no match), .text raises AttributeError. Check first or wrap in try.

Site requires JS. BeautifulSoup sees only the initial HTML. Use Playwright/Selenium or find the underlying API.

Aggressive scraping. Hammering a site → IP ban or worse. Throttle. Identify yourself.

Brittle selectors. Position-based or auto-generated class names break. Prefer stable identifiers.

Encoding mistakes. Use response.content (bytes) and let BeautifulSoup detect encoding. Or set response.encoding = "utf-8" if needed.

Ignoring robots.txt. Legal gray area; ethical clearly wrong if disallowed. Check before scraping.

Storing without dedup. Multi-page scraping can pick up the same item twice. Track IDs.

What's next

Lesson 34: SQLite Part 1. Embedded SQL database — sqlite3 module, connect, execute, fetchall.

Recap

BeautifulSoup(html, "html.parser") to parse. find() / find_all() for tag-based lookup; select() / select_one() for CSS selectors. .text for visible text, ["attr"] for attributes. Be polite: User-Agent, throttle, respect robots.txt. For JS-rendered sites, find the API or use a browser automation tool. Always handle missing elements (.find can return None).

Next lesson: SQLite Part 1.

Ready? Take the quiz on the full lesson page →
Test what you've learned. Watch the lesson and try the interactive quiz on the same page.