Web Scraping Using Python Tutorial: Website Scraping with Python (Requests, BeautifulSoup, Selenium, Scrapy)
ChatGPT & Benji AsperheimFri Aug 15th, 2025

Web Scraping Using Python Tutorial: Website Scraping with Python

Here’s a practical, no-nonsense guide to Python scraping that covers Requests/BS4, async, Scrapy, and headless browsers (Selenium/Playwright). I’ll be explicit about what’s safe/ethical and what’s a bad idea. I will not give evasion recipes for anti-bot systems (that crosses a line). I will show robust, production-grade patterns.



Core approaches & when to use them

ApproachWhen it’s bestProsCons
requests + BeautifulSoup (or lxml)Static HTML pages, small jobsSimple, fast, low depsNo JS execution
httpx + aiohttp (async)Many pages, I/O boundThroughput, concurrencyMore moving parts
ScrapyCrawlers with rules, pipelines, dedupe, schedulingBatteries-included, fastLearning curve
SeleniumHeavily JS-rendered pages; click/scroll flowsReal browser automationSlow, resource-heavy
Playwright (often better than Selenium)JS-heavy, needs reliabilityFaster, robust waitsHeadless browser overhead

Rule of thumb:


Requests + BeautifulSoup (baseline)

pip install requests beautifulsoup4 lxml
from __future__ import annotations
import time
from typing import List, Dict
import requests
from bs4 import BeautifulSoup

HEADERS = {"User-Agent": "UserBot/1.0 (+contact-url)"}

def fetch_html(url: str, timeout: int = 15) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text

def parse_items(html: str) -> List[Dict[str, str]]:
    soup = BeautifulSoup(html, "lxml")
    items = []
    for card in soup.select(".post-card"):
        items.append({
            "title": card.select_one(".post-title").get_text(strip=True),
            "url":   card.select_one("a")["href"],
            "date":  (card.select_one("time") or {}).get("datetime", "")
        })
    return items

def crawl_listing(url: str) -> List[Dict[str, str]]:
    html = fetch_html(url)
    data = parse_items(html)
    # Example simple pagination link
    soup = BeautifulSoup(html, "lxml")
    next_url = soup.select_one("a.next")
    if next_url:
        time.sleep(1.0)  # be polite
        data += crawl_listing(next_url["href"])
    return data

if __name__ == "__main__":
    items = crawl_listing("https://example.com/blog")
    print(len(items), "items")

Notes


Async Web Scraper in Python (aiohttp or httpx)

pip install aiohttp lxml selectolax
import asyncio
from typing import List
import aiohttp
from selectolax.parser import HTMLParser

HEADERS = {"User-Agent": "UserBot/1.0 (+contact-url)"}

async def fetch(session: aiohttp.ClientSession, url: str) -> str:
    async with session.get(url, headers=HEADERS, timeout=30) as r:
        r.raise_for_status()
        return await r.text()

def parse_titles(html: str) -> List[str]:
    tree = HTMLParser(html)
    return [n.text(strip=True) for n in tree.css(".post-title")]

async def main(urls: List[str]) -> None:
    connector = aiohttp.TCPConnector(limit=10)  # cap concurrency
    async with aiohttp.ClientSession(connector=connector) as session:
        for i in range(0, len(urls), 10):
            batch = urls[i:i+10]
            htmls = await asyncio.gather(*(fetch(session, u) for u in batch))
            for html in htmls:
                print(parse_titles(html))

if __name__ == "__main__":
    asyncio.run(main([f"https://example.com/page/{i}" for i in range(1, 51)]))

Notes


Scrapy Web Scraper in Python (for real crawlers)

pip install scrapy
scrapy startproject blogcrawler

Here’s a blogcrawler/spiders/posts.py Python script example that uses scrapy:

import scrapy

class PostsSpider(scrapy.Spider):
    name = "posts"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/blog"]

    custom_settings = {
        "ROBOTSTXT_OBEY": True,
        "DOWNLOAD_DELAY": 0.5,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 8,
        "USER_AGENT": "UserBot/1.0 (+contact-url)",
        "FEEDS": {"items.json": {"format": "jsonlines"}},
    }

    def parse(self, response):
        for card in response.css(".post-card"):
            yield {
                "title": card.css(".post-title::text").get(default="").strip(),
                "url": response.urljoin(card.css("a::attr(href)").get()),
                "date": card.css("time::attr(datetime)").get(default="")
            }
        next_href = response.css("a.next::attr(href)").get()
        if next_href:
            yield response.follow(next_href, callback=self.parse)

Why Scrapy


Prefer JSON over HTML when possible

Most JS sites call JSON endpoints. Grab those directly (but install with pip install requests first):

import requests

HEADERS = {"User-Agent": "UserBot/1.0 (+contact-url)"}
r = requests.get("https://example.com/api/articles?limit=50", headers=HEADERS, timeout=20)
r.raise_for_status()
data = r.json()
for a in data["items"]:
    print(a["title"], a["permalink"])

Requests is a third-party HTTP library that offers a simpler, more human-friendly API than Python’s built-in modules (like urllib). It isn’t part of the standard library because including it by default would bloat Python with features not everyone needs. Also, keeping it separate lets it evolve more quickly—bug fixes and new features aren’t tied to Python’s release cycle.

How to find: Open DevTools → Network → XHR/Fetch → copy request as cURL → convert to Python. Respect CORS/auth; don’t lift tokens you’re not allowed to use.


Web Scraper in Python Using Selenium

When the page must render, then use Selenium.

Install:

pip install selenium webdriver-manager

Basic pattern (using a Chrome webdriver):

from typing import List, Dict
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def get_driver(headless: bool = True) -> webdriver.Chrome:
    options = webdriver.ChromeOptions()
    if headless:
        options.add_argument("--headless=new")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-gpu")
    options.add_argument("--window-size=1366,768")
    options.add_argument("user-agent=UserBot/1.0 (+contact-url)")
    service = Service(ChromeDriverManager().install())
    return webdriver.Chrome(service=service, options=options)

def scrape_dynamic(url: str) -> List[Dict[str, str]]:
    driver = get_driver()
    try:
        driver.get(url)
        WebDriverWait(driver, 20).until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".post-card"))
        )
        cards = driver.find_elements(By.CSS_SELECTOR, ".post-card")
        items = []
        for c in cards:
            title = c.find_element(By.CSS_SELECTOR, ".post-title").text.strip()
            href  = c.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
            date  = c.find_element(By.CSS_SELECTOR, "time").get_attribute("datetime")
            items.append({"title": title, "url": href, "date": date})
        return items
    finally:
        driver.quit()

if __name__ == "__main__":
    print(scrape_dynamic("https://example.com/blog"))

Infinite scroll / “Load more”** (safe pattern)

You can use a loop to load more HTML elements like so:

from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time

def scroll_to_load(driver, max_scrolls=10, pause=1.0):
    body = driver.find_element(By.TAG_NAME, "body")
    for _ in range(max_scrolls):
        body.send_keys(Keys.END)
        time.sleep(pause)

Selenium caveats


Playwright Website Scraping with Python

Playwright is often the better headless choice when it comes to web scraping in Python.

Install the playwright PIP package:

pip install playwright
playwright install

The following is some example Python code to initiate a headless browser session to scrape the web:

from typing import List, Dict
from playwright.sync_api import sync_playwright

def scrape_playwright(url: str) -> List[Dict[str, str]]:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page(user_agent="UserBot/1.0 (+contact-url)")
        page.goto(url, wait_until="domcontentloaded", timeout=30000)
        page.wait_for_selector(".post-card", timeout=20000)
        cards = page.query_selector_all(".post-card")
        items = []
        for c in cards:
            items.append({
                "title": c.query_selector(".post-title").inner_text().strip(),
                "url": c.query_selector("a").get_attribute("href"),
                "date": (c.query_selector("time") or {}).get_attribute("datetime"),
            })
        browser.close()
        return items

if __name__ == "__main__":
    print(scrape_playwright("https://example.com/blog"))

Why Playwright


Handling Data Blocks While Scraping

Here’s are some key ideas to consider, without playing “cat-and-mouse” games, while scraping websites:

Google example (do this instead of scraping SERPs):

pip install google-api-python-client
from googleapiclient.discovery import build

API_KEY = "YOUR_KEY"
CX = "YOUR_CSE_ID"

service = build("customsearch", "v1", developerKey=API_KEY)
res = service.cse().list(q="site:angular.dev signal inputs", cx=CX, num=10).execute()
for item in res.get("items", []):
    print(item["title"], item["link"])

Python Data Modeling and Storage

Here’s some data modeling, validation, and storage examples using pydantic.

Install this with pip as well:

pip install pydantic

Define typed models (pydantic/dataclasses) so you don’t write junk:

from pydantic import BaseModel, HttpUrl
from typing import Optional

class Post(BaseModel):
    title: str
    url: HttpUrl
    date: Optional[str] = None

Store to CSV/JSON for small jobs; SQLite/Postgres for bigger ones:

import sqlite3

def save_sqlite(rows):
    conn = sqlite3.connect("items.db")
    cur = conn.cursor()
    cur.execute("CREATE TABLE IF NOT EXISTS posts(title TEXT, url TEXT, date TEXT)")
    cur.executemany("INSERT INTO posts(title, url, date) VALUES (?, ?, ?)",
                    [(r["title"], r["url"], r.get("date","")) for r in rows])
    conn.commit()
    conn.close()

Robust Data Patterns in Python

pip install tenacity
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests

@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=0.5, min=1, max=30),
    retry=retry_if_exception_type((requests.HTTPError, requests.ConnectionError, requests.Timeout)),
)
def get_json(url: str) -> dict:
    r = requests.get(url, headers=HEADERS, timeout=15)
    r.raise_for_status()
    return r.json()

Common Scraping Patterns

Pagination

Detail pages

File downloads

Localization/AB variants


Web Scraping Tests & CI


Putting All the Python Web Scraping Together

# 1) Discover API via DevTools. If JSON exists, use that. Else:
html = fetch_html("https://example.com/blog")
listing = parse_items(html)
# 2) Batch detail fetches politely (requests or aiohttp)
# 3) Validate with pydantic, dedupe, persist
# 4) Schedule with cron/GitHub Actions; log metrics

Quick Python Web Scraping Decision Tree

  1. Is there an official API? Use it.
  2. Is content static or backed by JSON endpoints? Use Requests to the JSON.
  3. Is content only visible post-render + interactions? Use Playwright (or Selenium).
  4. Is it a large, rule-driven crawl? Use Scrapy.
  5. Is it protected / presents challenges (CAPTCHA, JS checks, terms forbid)? Stop. Re-route.

When Google Blocks BS4/Requests

If you attempt to scrape Google Search, or similar protected properties, it ends up not being worth the trouble: on ethical and practical grounds—don’t fight it. Use Google Custom Search JSON API or a compliant provider. Selenium can render pages, but using it to extract SERPs is still against Google’s terms and risks breakage and account penalties. It’s slower and costlier than doing the right thing with an API.

Conclusion