Back to Blog

Web Scraping (BeautifulSoup, Selectors, Pagination) - Python Tutorial for Beginners #33

Sandy LaneSandy Lane

Video: Web Scraping (BeautifulSoup, Selectors, Pagination) - Python Tutorial for Beginners #33 by Taught by Celeste AI - AI Coding Coach

Watch full page →

Web Scraping with BeautifulSoup: Fetching, Parsing, and Pagination in Python

Discover how to scrape web pages using Python by fetching HTML content with requests and parsing it with BeautifulSoup. Learn to extract data using tag searches and CSS selectors, and handle multi-page scraping through pagination links to collect comprehensive datasets.

Code

import requests
from bs4 import BeautifulSoup
from collections import Counter

# Function to scrape quotes from a single page URL
def scrape_quotes(url):
  response = requests.get(url)
  soup = BeautifulSoup(response.text, 'html.parser')

  quotes = []
  # Find all quote blocks by their class
  for quote_div in soup.find_all('div', class_='quote'):
    text = quote_div.find('span', class_='text').get_text()
    author = quote_div.find('small', class_='author').get_text()
    tags = [tag.get_text() for tag in quote_div.select('.tags a.tag')]
    quotes.append({'text': text, 'author': author, 'tags': tags})
  return quotes

# Function to get the next page URL from pagination
def get_next_page(soup):
  next_button = soup.select_one('li.next a')
  if next_button:
    return 'http://quotes.toscrape.com' + next_button['href']
  return None

# Main scraping loop to collect quotes from multiple pages
def scrape_all_quotes(start_url):
  all_quotes = []
  url = start_url

  while url:
    print(f'Scraping {url}')
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    all_quotes.extend(scrape_quotes(url))
    url = get_next_page(soup)

  return all_quotes

if __name__ == '__main__':
  start_url = 'http://quotes.toscrape.com'
  quotes = scrape_all_quotes(start_url)

  # Count the most common tags across all quotes
  all_tags = [tag for quote in quotes for tag in quote['tags']]
  tag_counts = Counter(all_tags)

  print(f'Total quotes scraped: {len(quotes)}')
  print('Most common tags:', tag_counts.most_common(5))

Key Points

  • Use requests.get() to fetch HTML content from web pages.
  • Parse HTML with BeautifulSoup to navigate and extract elements using find(), find_all(), and CSS selectors like select().
  • Extract text and attributes cleanly to build structured data such as dictionaries.
  • Handle pagination by locating "next" page links and iterating through multiple pages to scrape comprehensive datasets.
  • Analyze scraped data with Python tools like collections.Counter to summarize and gain insights.