ChatGPT & Benji Asperheim— Fri Aug 15th, 2025

Web Scraping Using Python Tutorial: Website Scraping with Python

Here’s a practical, no-nonsense guide to Python scraping that covers Requests/BS4, async, Scrapy, and headless browsers (Selenium/Playwright). I’ll be explicit about what’s safe/ethical and what’s a bad idea. I will not give evasion recipes for anti-bot systems (that crosses a line). I will show robust, production-grade patterns.

Legal, ethical, and practical guardrails (read this)

Check robots.txt + Terms. If disallowed, don’t scrape. If data is behind auth or a paywall, don’t bypass.
Prefer official APIs when available (news/search/cloud platforms often have one). They’re faster, cheaper, and safer.
Throttle: small concurrent limits, randomized polite delays, proper retries; identify yourself in a User-Agent.
Don’t scrape Google SERPs. Use a sanctioned API (e.g., Google Custom Search JSON API) or a third-party search API that handles compliance.
Avoid CAPTCHAs: treat them as a stop sign, not a challenge to beat.

Core approaches & when to use them

Approach	When it’s best	Pros	Cons
`requests` + BeautifulSoup (or `lxml`)	Static HTML pages, small jobs	Simple, fast, low deps	No JS execution
`httpx` + `aiohttp` (async)	Many pages, I/O bound	Throughput, concurrency	More moving parts
Scrapy	Crawlers with rules, pipelines, dedupe, scheduling	Batteries-included, fast	Learning curve
Selenium	Heavily JS-rendered pages; click/scroll flows	Real browser automation	Slow, resource-heavy
Playwright (often better than Selenium)	JS-heavy, needs reliability	Faster, robust waits	Headless browser overhead

Rule of thumb:

First try Requests/BS4.
If the site is JS-rendered, try to grab the same JSON the page fetches (devtools → Network) with Requests.
If you truly need a browser, pick Playwright; use Selenium if your team already standardizes on it or your infra requires it.
Use Scrapy when the task is a crawler with lots of pagination, item pipelines, and dedupe.

Requests + BeautifulSoup (baseline)

pip install requests beautifulsoup4 lxml

from __future__ import annotations
import time
from typing import List, Dict
import requests
from bs4 import BeautifulSoup

HEADERS = {"User-Agent": "UserBot/1.0 (+contact-url)"}

def fetch_html(url: str, timeout: int = 15) -> str:
    r = requests.get(url, headers=HEADERS, timeout=timeout)
    r.raise_for_status()
    return r.text

def parse_items(html: str) -> List[Dict[str, str]]:
    soup = BeautifulSoup(html, "lxml")
    items = []
    for card in soup.select(".post-card"):
        items.append({
            "title": card.select_one(".post-title").get_text(strip=True),
            "url":   card.select_one("a")["href"],
            "date":  (card.select_one("time") or {}).get("datetime", "")
        })
    return items

def crawl_listing(url: str) -> List[Dict[str, str]]:
    html = fetch_html(url)
    data = parse_items(html)
    # Example simple pagination link
    soup = BeautifulSoup(html, "lxml")
    next_url = soup.select_one("a.next")
    if next_url:
        time.sleep(1.0)  # be polite
        data += crawl_listing(next_url["href"])
    return data

if __name__ == "__main__":
    items = crawl_listing("https://example.com/blog")
    print(len(items), "items")

Notes

Use soup.select() with CSS selectors; switch to lxml.html with XPath when needed.
Use r.raise_for_status() to fail fast on 4xx/5xx.
Don’t hammer the site. Sleep between pages.

Async Web Scraper in Python (`aiohttp` or `httpx`)

pip install aiohttp lxml selectolax

import asyncio
from typing import List
import aiohttp
from selectolax.parser import HTMLParser

HEADERS = {"User-Agent": "UserBot/1.0 (+contact-url)"}

async def fetch(session: aiohttp.ClientSession, url: str) -> str:
    async with session.get(url, headers=HEADERS, timeout=30) as r:
        r.raise_for_status()
        return await r.text()

def parse_titles(html: str) -> List[str]:
    tree = HTMLParser(html)
    return [n.text(strip=True) for n in tree.css(".post-title")]

async def main(urls: List[str]) -> None:
    connector = aiohttp.TCPConnector(limit=10)  # cap concurrency
    async with aiohttp.ClientSession(connector=connector) as session:
        for i in range(0, len(urls), 10):
            batch = urls[i:i+10]
            htmls = await asyncio.gather(*(fetch(session, u) for u in batch))
            for html in htmls:
                print(parse_titles(html))

if __name__ == "__main__":
    asyncio.run(main([f"https://example.com/page/{i}" for i in range(1, 51)]))

Notes

Keep concurrency low (5—20) unless the site explicitly allows higher.
Add retries with backoff and caching (e.g., requests-cache) if you’re iterating locally.

Scrapy Web Scraper in Python (for real crawlers)

pip install scrapy
scrapy startproject blogcrawler

Here’s a blogcrawler/spiders/posts.py Python script example that uses scrapy:

import scrapy

class PostsSpider(scrapy.Spider):
    name = "posts"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/blog"]

    custom_settings = {
        "ROBOTSTXT_OBEY": True,
        "DOWNLOAD_DELAY": 0.5,
        "CONCURRENT_REQUESTS_PER_DOMAIN": 8,
        "USER_AGENT": "UserBot/1.0 (+contact-url)",
        "FEEDS": {"items.json": {"format": "jsonlines"}},
    }

    def parse(self, response):
        for card in response.css(".post-card"):
            yield {
                "title": card.css(".post-title::text").get(default="").strip(),
                "url": response.urljoin(card.css("a::attr(href)").get()),
                "date": card.css("time::attr(datetime)").get(default="")
            }
        next_href = response.css("a.next::attr(href)").get()
        if next_href:
            yield response.follow(next_href, callback=self.parse)

Why Scrapy

Built-in throttling, caching, retries, pipelines, and export formats.
Easy to scale, schedule, and dedupe.

Prefer JSON over HTML when possible

Most JS sites call JSON endpoints. Grab those directly (but install with pip install requests first):

import requests

HEADERS = {"User-Agent": "UserBot/1.0 (+contact-url)"}
r = requests.get("https://example.com/api/articles?limit=50", headers=HEADERS, timeout=20)
r.raise_for_status()
data = r.json()
for a in data["items"]:
    print(a["title"], a["permalink"])

Requests is a third-party HTTP library that offers a simpler, more human-friendly API than Python’s built-in modules (like urllib). It isn’t part of the standard library because including it by default would bloat Python with features not everyone needs. Also, keeping it separate lets it evolve more quickly—bug fixes and new features aren’t tied to Python’s release cycle.

How to find: Open DevTools → Network → XHR/Fetch → copy request as cURL → convert to Python. Respect CORS/auth; don’t lift tokens you’re not allowed to use.

Web Scraper in Python Using Selenium

When the page must render, then use Selenium.

Install:

pip install selenium webdriver-manager

Basic pattern (using a Chrome webdriver):

from typing import List, Dict
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def get_driver(headless: bool = True) -> webdriver.Chrome:
    options = webdriver.ChromeOptions()
    if headless:
        options.add_argument("--headless=new")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-gpu")
    options.add_argument("--window-size=1366,768")
    options.add_argument("user-agent=UserBot/1.0 (+contact-url)")
    service = Service(ChromeDriverManager().install())
    return webdriver.Chrome(service=service, options=options)

def scrape_dynamic(url: str) -> List[Dict[str, str]]:
    driver = get_driver()
    try:
        driver.get(url)
        WebDriverWait(driver, 20).until(
            EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".post-card"))
        )
        cards = driver.find_elements(By.CSS_SELECTOR, ".post-card")
        items = []
        for c in cards:
            title = c.find_element(By.CSS_SELECTOR, ".post-title").text.strip()
            href  = c.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
            date  = c.find_element(By.CSS_SELECTOR, "time").get_attribute("datetime")
            items.append({"title": title, "url": href, "date": date})
        return items
    finally:
        driver.quit()

if __name__ == "__main__":
    print(scrape_dynamic("https://example.com/blog"))

Infinite scroll / “Load more”** (safe pattern)

You can use a loop to load more HTML elements like so:

from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time

def scroll_to_load(driver, max_scrolls=10, pause=1.0):
    body = driver.find_element(By.TAG_NAME, "body")
    for _ in range(max_scrolls):
        body.send_keys(Keys.END)
        time.sleep(pause)

Selenium caveats

It’s slow and heavy. Use it only when you must click/scroll or when the content only exists post-render.
If a site actively blocks automation or throws challenges, stop and find an official/API route. Don’t try to defeat protections.

Playwright Website Scraping with Python

Playwright is often the better headless choice when it comes to web scraping in Python.

Install the playwright PIP package:

pip install playwright
playwright install

The following is some example Python code to initiate a headless browser session to scrape the web:

from typing import List, Dict
from playwright.sync_api import sync_playwright

def scrape_playwright(url: str) -> List[Dict[str, str]]:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page(user_agent="UserBot/1.0 (+contact-url)")
        page.goto(url, wait_until="domcontentloaded", timeout=30000)
        page.wait_for_selector(".post-card", timeout=20000)
        cards = page.query_selector_all(".post-card")
        items = []
        for c in cards:
            items.append({
                "title": c.query_selector(".post-title").inner_text().strip(),
                "url": c.query_selector("a").get_attribute("href"),
                "date": (c.query_selector("time") or {}).get_attribute("datetime"),
            })
        browser.close()
        return items

if __name__ == "__main__":
    print(scrape_playwright("https://example.com/blog"))

Why Playwright

Reliable waits (wait_until="networkidle"), context isolation, tracing.
Generally faster and less flaky than Selenium.

Handling Data Blocks While Scraping

Here’s are some key ideas to consider, without playing “cat-and-mouse” games, while scraping websites:

Back off on 429/503. Respect rate limits (HINT: Use Python’s time.sleep() function).
Reduce concurrency, increase delays, fetch in off-peak hours.
Identify yourself with a clear UA and contact; some sites whitelist or throttle fairly.
Ask for permission or request an API key. Many owners will help if you explain the use case.
If the target is Google Search or other protected/monetized endpoints, don’t scrape. Use an API.

Google example (do this instead of scraping SERPs):

pip install google-api-python-client

from googleapiclient.discovery import build

API_KEY = "YOUR_KEY"
CX = "YOUR_CSE_ID"

service = build("customsearch", "v1", developerKey=API_KEY)
res = service.cse().list(q="site:angular.dev signal inputs", cx=CX, num=10).execute()
for item in res.get("items", []):
    print(item["title"], item["link"])

Python Data Modeling and Storage

Here’s some data modeling, validation, and storage examples using pydantic.

Install this with pip as well:

pip install pydantic

Define typed models (pydantic/dataclasses) so you don’t write junk:

from pydantic import BaseModel, HttpUrl
from typing import Optional

class Post(BaseModel):
    title: str
    url: HttpUrl
    date: Optional[str] = None

Store to CSV/JSON for small jobs; SQLite/Postgres for bigger ones:

import sqlite3

def save_sqlite(rows):
    conn = sqlite3.connect("items.db")
    cur = conn.cursor()
    cur.execute("CREATE TABLE IF NOT EXISTS posts(title TEXT, url TEXT, date TEXT)")
    cur.executemany("INSERT INTO posts(title, url, date) VALUES (?, ?, ?)",
                    [(r["title"], r["url"], r.get("date","")) for r in rows])
    conn.commit()
    conn.close()

Robust Data Patterns in Python

Retry with jitter/backoff (e.g., tenacity).

pip install tenacity

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests

@retry(
    reraise=True,
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=0.5, min=1, max=30),
    retry=retry_if_exception_type((requests.HTTPError, requests.ConnectionError, requests.Timeout)),
)
def get_json(url: str) -> dict:
    r = requests.get(url, headers=HEADERS, timeout=15)
    r.raise_for_status()
    return r.json()

Deduplicate by URL hash; idempotency in pipelines.
Caching (requests-cache) during development to avoid hammering targets.
Structured logs and metrics (success/skip/fail counts).
Selectors resilient to markup drift: prefer IDs/data-attributes, not nth-child chains.

Common Scraping Patterns

Pagination

“Next” link → follow until absent.
Offset/limit query params → increase until empty page.
Cursor tokens (GraphQL/rest) → follow nextCursor.

Detail pages

First collect listing URLs; then fetch detail pages in small batches.

File downloads

Stream with requests.get(..., stream=True), write chunks, verify size/hash.

Localization/AB variants

Sites may vary markup by geo/AB test. Build fallbacks and capture unknowns for review.

Web Scraping Tests & CI

Unit-test parsers with saved fixtures (HTML snapshots) to detect breakage.
Run small CI checks nightly with no network (use fixtures) and a weekly small live probe against a whitelisted sandbox.
Smoke test: fail if 0 items parsed or if the DOM contract changes.

Putting All the Python Web Scraping Together

# 1) Discover API via DevTools. If JSON exists, use that. Else:
html = fetch_html("https://example.com/blog")
listing = parse_items(html)
# 2) Batch detail fetches politely (requests or aiohttp)
# 3) Validate with pydantic, dedupe, persist
# 4) Schedule with cron/GitHub Actions; log metrics

Quick Python Web Scraping Decision Tree

Is there an official API? Use it.
Is content static or backed by JSON endpoints? Use Requests to the JSON.
Is content only visible post-render + interactions? Use Playwright (or Selenium).
Is it a large, rule-driven crawl? Use Scrapy.
Is it protected / presents challenges (CAPTCHA, JS checks, terms forbid)? Stop. Re-route.

When Google Blocks BS4/Requests

If you attempt to scrape Google Search, or similar protected properties, it ends up not being worth the trouble: on ethical and practical grounds—don’t fight it. Use Google Custom Search JSON API or a compliant provider. Selenium can render pages, but using it to extract SERPs is still against Google’s terms and risks breakage and account penalties. It’s slower and costlier than doing the right thing with an API.

Conclusion

Start simple: Requests + BeautifulSoup (or lxml) for static pages; prefer JSON endpoints you find in DevTools over parsing HTML.
Scale responsibly: for many pages use httpx/aiohttp; for rule-driven crawls and pipelines use Scrapy.
Bring a browser only when necessary: Playwright (often best) or Selenium for JS-rendered flows, waits, and interactions.
Be ethical and durable: respect robots.txt/ToS, throttle, add retries with exponential backoff, cache during development, and log/validate results with typed models.
Productionize: write resilient selectors, deduplicate by URL, store to SQLite/Postgres, and test parsers with fixtures so markup changes don’t silently break your jobs.
When a site is protected (e.g., Google Search), don’t play cat-and-mouse—use official APIs or obtain permission.