Web Scraping Using Python Tutorial: Website Scraping with Python
Here’s a practical, no-nonsense guide to Python scraping that covers Requests/BS4, async, Scrapy, and headless browsers (Selenium/Playwright). I’ll be explicit about what’s safe/ethical and what’s a bad idea. I will not give evasion recipes for anti-bot systems (that crosses a line). I will show robust, production-grade patterns.
Legal, ethical, and practical guardrails (read this)
- Check
robots.txt
+ Terms. If disallowed, don’t scrape. If data is behind auth or a paywall, don’t bypass. - Prefer official APIs when available (news/search/cloud platforms often have one). They’re faster, cheaper, and safer.
- Throttle: small concurrent limits, randomized polite delays, proper retries; identify yourself in a
User-Agent
. - Don’t scrape Google SERPs. Use a sanctioned API (e.g., Google Custom Search JSON API) or a third-party search API that handles compliance.
- Avoid CAPTCHAs: treat them as a stop sign, not a challenge to beat.
Core approaches & when to use them
Approach | When it’s best | Pros | Cons |
---|---|---|---|
requests + BeautifulSoup (or lxml ) | Static HTML pages, small jobs | Simple, fast, low deps | No JS execution |
httpx + aiohttp (async) | Many pages, I/O bound | Throughput, concurrency | More moving parts |
Scrapy | Crawlers with rules, pipelines, dedupe, scheduling | Batteries-included, fast | Learning curve |
Selenium | Heavily JS-rendered pages; click/scroll flows | Real browser automation | Slow, resource-heavy |
Playwright (often better than Selenium) | JS-heavy, needs reliability | Faster, robust waits | Headless browser overhead |
Rule of thumb:
- First try Requests/BS4.
- If the site is JS-rendered, try to grab the same JSON the page fetches (devtools → Network) with Requests.
- If you truly need a browser, pick Playwright; use Selenium if your team already standardizes on it or your infra requires it.
- Use Scrapy when the task is a crawler with lots of pagination, item pipelines, and dedupe.
Requests + BeautifulSoup (baseline)
pip install requests beautifulsoup4 lxml
from __future__ import annotations
import time
from typing import List, Dict
import requests
from bs4 import BeautifulSoup
HEADERS = {"User-Agent": "UserBot/1.0 (+contact-url)"}
def fetch_html(url: str, timeout: int = 15) -> str:
r = requests.get(url, headers=HEADERS, timeout=timeout)
r.raise_for_status()
return r.text
def parse_items(html: str) -> List[Dict[str, str]]:
soup = BeautifulSoup(html, "lxml")
items = []
for card in soup.select(".post-card"):
items.append({
"title": card.select_one(".post-title").get_text(strip=True),
"url": card.select_one("a")["href"],
"date": (card.select_one("time") or {}).get("datetime", "")
})
return items
def crawl_listing(url: str) -> List[Dict[str, str]]:
html = fetch_html(url)
data = parse_items(html)
# Example simple pagination link
soup = BeautifulSoup(html, "lxml")
next_url = soup.select_one("a.next")
if next_url:
time.sleep(1.0) # be polite
data += crawl_listing(next_url["href"])
return data
if __name__ == "__main__":
items = crawl_listing("https://example.com/blog")
print(len(items), "items")
Notes
- Use
soup.select()
with CSS selectors; switch tolxml.html
with XPath when needed. - Use
r.raise_for_status()
to fail fast on 4xx/5xx. - Don’t hammer the site. Sleep between pages.
Async Web Scraper in Python (aiohttp
or httpx
)
pip install aiohttp lxml selectolax
import asyncio
from typing import List
import aiohttp
from selectolax.parser import HTMLParser
HEADERS = {"User-Agent": "UserBot/1.0 (+contact-url)"}
async def fetch(session: aiohttp.ClientSession, url: str) -> str:
async with session.get(url, headers=HEADERS, timeout=30) as r:
r.raise_for_status()
return await r.text()
def parse_titles(html: str) -> List[str]:
tree = HTMLParser(html)
return [n.text(strip=True) for n in tree.css(".post-title")]
async def main(urls: List[str]) -> None:
connector = aiohttp.TCPConnector(limit=10) # cap concurrency
async with aiohttp.ClientSession(connector=connector) as session:
for i in range(0, len(urls), 10):
batch = urls[i:i+10]
htmls = await asyncio.gather(*(fetch(session, u) for u in batch))
for html in htmls:
print(parse_titles(html))
if __name__ == "__main__":
asyncio.run(main([f"https://example.com/page/{i}" for i in range(1, 51)]))
Notes
- Keep concurrency low (5—20) unless the site explicitly allows higher.
- Add retries with backoff and caching (e.g.,
requests-cache
) if you’re iterating locally.
Scrapy Web Scraper in Python (for real crawlers)
pip install scrapy
scrapy startproject blogcrawler
Here’s a blogcrawler/spiders/posts.py
Python script example that uses scrapy:
import scrapy
class PostsSpider(scrapy.Spider):
name = "posts"
allowed_domains = ["example.com"]
start_urls = ["https://example.com/blog"]
custom_settings = {
"ROBOTSTXT_OBEY": True,
"DOWNLOAD_DELAY": 0.5,
"CONCURRENT_REQUESTS_PER_DOMAIN": 8,
"USER_AGENT": "UserBot/1.0 (+contact-url)",
"FEEDS": {"items.json": {"format": "jsonlines"}},
}
def parse(self, response):
for card in response.css(".post-card"):
yield {
"title": card.css(".post-title::text").get(default="").strip(),
"url": response.urljoin(card.css("a::attr(href)").get()),
"date": card.css("time::attr(datetime)").get(default="")
}
next_href = response.css("a.next::attr(href)").get()
if next_href:
yield response.follow(next_href, callback=self.parse)
Why Scrapy
- Built-in throttling, caching, retries, pipelines, and export formats.
- Easy to scale, schedule, and dedupe.
Prefer JSON over HTML when possible
Most JS sites call JSON endpoints. Grab those directly (but install with pip install requests
first):
import requests
HEADERS = {"User-Agent": "UserBot/1.0 (+contact-url)"}
r = requests.get("https://example.com/api/articles?limit=50", headers=HEADERS, timeout=20)
r.raise_for_status()
data = r.json()
for a in data["items"]:
print(a["title"], a["permalink"])
Requests is a third-party HTTP library that offers a simpler, more human-friendly API than Python’s built-in modules (like urllib). It isn’t part of the standard library because including it by default would bloat Python with features not everyone needs. Also, keeping it separate lets it evolve more quickly—bug fixes and new features aren’t tied to Python’s release cycle.
How to find: Open DevTools → Network → XHR/Fetch → copy request as cURL → convert to Python. Respect CORS/auth; don’t lift tokens you’re not allowed to use.
Web Scraper in Python Using Selenium
When the page must render, then use Selenium.
Install:
pip install selenium webdriver-manager
Basic pattern (using a Chrome webdriver):
from typing import List, Dict
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def get_driver(headless: bool = True) -> webdriver.Chrome:
options = webdriver.ChromeOptions()
if headless:
options.add_argument("--headless=new")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1366,768")
options.add_argument("user-agent=UserBot/1.0 (+contact-url)")
service = Service(ChromeDriverManager().install())
return webdriver.Chrome(service=service, options=options)
def scrape_dynamic(url: str) -> List[Dict[str, str]]:
driver = get_driver()
try:
driver.get(url)
WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".post-card"))
)
cards = driver.find_elements(By.CSS_SELECTOR, ".post-card")
items = []
for c in cards:
title = c.find_element(By.CSS_SELECTOR, ".post-title").text.strip()
href = c.find_element(By.CSS_SELECTOR, "a").get_attribute("href")
date = c.find_element(By.CSS_SELECTOR, "time").get_attribute("datetime")
items.append({"title": title, "url": href, "date": date})
return items
finally:
driver.quit()
if __name__ == "__main__":
print(scrape_dynamic("https://example.com/blog"))
Infinite scroll / “Load more”** (safe pattern)
You can use a loop to load more HTML elements like so:
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import time
def scroll_to_load(driver, max_scrolls=10, pause=1.0):
body = driver.find_element(By.TAG_NAME, "body")
for _ in range(max_scrolls):
body.send_keys(Keys.END)
time.sleep(pause)
Selenium caveats
- It’s slow and heavy. Use it only when you must click/scroll or when the content only exists post-render.
- If a site actively blocks automation or throws challenges, stop and find an official/API route. Don’t try to defeat protections.
Playwright Website Scraping with Python
Playwright is often the better headless choice when it comes to web scraping in Python.
Install the playwright PIP package:
pip install playwright
playwright install
The following is some example Python code to initiate a headless browser session to scrape the web:
from typing import List, Dict
from playwright.sync_api import sync_playwright
def scrape_playwright(url: str) -> List[Dict[str, str]]:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page(user_agent="UserBot/1.0 (+contact-url)")
page.goto(url, wait_until="domcontentloaded", timeout=30000)
page.wait_for_selector(".post-card", timeout=20000)
cards = page.query_selector_all(".post-card")
items = []
for c in cards:
items.append({
"title": c.query_selector(".post-title").inner_text().strip(),
"url": c.query_selector("a").get_attribute("href"),
"date": (c.query_selector("time") or {}).get_attribute("datetime"),
})
browser.close()
return items
if __name__ == "__main__":
print(scrape_playwright("https://example.com/blog"))
Why Playwright
- Reliable waits (
wait_until="networkidle"
), context isolation, tracing. - Generally faster and less flaky than Selenium.
Handling Data Blocks While Scraping
Here’s are some key ideas to consider, without playing “cat-and-mouse” games, while scraping websites:
- Back off on 429/503. Respect rate limits (HINT: Use Python’s time.sleep() function).
- Reduce concurrency, increase delays, fetch in off-peak hours.
- Identify yourself with a clear UA and contact; some sites whitelist or throttle fairly.
- Ask for permission or request an API key. Many owners will help if you explain the use case.
- If the target is Google Search or other protected/monetized endpoints, don’t scrape. Use an API.
Google example (do this instead of scraping SERPs):
pip install google-api-python-client
from googleapiclient.discovery import build
API_KEY = "YOUR_KEY"
CX = "YOUR_CSE_ID"
service = build("customsearch", "v1", developerKey=API_KEY)
res = service.cse().list(q="site:angular.dev signal inputs", cx=CX, num=10).execute()
for item in res.get("items", []):
print(item["title"], item["link"])
Python Data Modeling and Storage
Here’s some data modeling, validation, and storage examples using pydantic.
Install this with pip
as well:
pip install pydantic
Define typed models (pydantic/dataclasses) so you don’t write junk:
from pydantic import BaseModel, HttpUrl
from typing import Optional
class Post(BaseModel):
title: str
url: HttpUrl
date: Optional[str] = None
Store to CSV/JSON for small jobs; SQLite/Postgres for bigger ones:
import sqlite3
def save_sqlite(rows):
conn = sqlite3.connect("items.db")
cur = conn.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS posts(title TEXT, url TEXT, date TEXT)")
cur.executemany("INSERT INTO posts(title, url, date) VALUES (?, ?, ?)",
[(r["title"], r["url"], r.get("date","")) for r in rows])
conn.commit()
conn.close()
Robust Data Patterns in Python
- Retry with jitter/backoff (e.g.,
tenacity
).
pip install tenacity
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import requests
@retry(
reraise=True,
stop=stop_after_attempt(5),
wait=wait_exponential(multiplier=0.5, min=1, max=30),
retry=retry_if_exception_type((requests.HTTPError, requests.ConnectionError, requests.Timeout)),
)
def get_json(url: str) -> dict:
r = requests.get(url, headers=HEADERS, timeout=15)
r.raise_for_status()
return r.json()
- Deduplicate by URL hash; idempotency in pipelines.
- Caching (
requests-cache
) during development to avoid hammering targets. - Structured logs and metrics (success/skip/fail counts).
- Selectors resilient to markup drift: prefer IDs/data-attributes, not nth-child chains.
Common Scraping Patterns
Pagination
- “Next” link → follow until absent.
- Offset/limit query params → increase until empty page.
- Cursor tokens (GraphQL/rest) → follow
nextCursor
.
Detail pages
- First collect listing URLs; then fetch detail pages in small batches.
File downloads
- Stream with
requests.get(..., stream=True)
, write chunks, verify size/hash.
Localization/AB variants
- Sites may vary markup by geo/AB test. Build fallbacks and capture unknowns for review.
Web Scraping Tests & CI
- Unit-test parsers with saved fixtures (HTML snapshots) to detect breakage.
- Run small CI checks nightly with no network (use fixtures) and a weekly small live probe against a whitelisted sandbox.
- Smoke test: fail if 0 items parsed or if the DOM contract changes.
Putting All the Python Web Scraping Together
# 1) Discover API via DevTools. If JSON exists, use that. Else:
html = fetch_html("https://example.com/blog")
listing = parse_items(html)
# 2) Batch detail fetches politely (requests or aiohttp)
# 3) Validate with pydantic, dedupe, persist
# 4) Schedule with cron/GitHub Actions; log metrics
Quick Python Web Scraping Decision Tree
- Is there an official API? Use it.
- Is content static or backed by JSON endpoints? Use Requests to the JSON.
- Is content only visible post-render + interactions? Use Playwright (or Selenium).
- Is it a large, rule-driven crawl? Use Scrapy.
- Is it protected / presents challenges (CAPTCHA, JS checks, terms forbid)? Stop. Re-route.
When Google Blocks BS4/Requests
If you attempt to scrape Google Search, or similar protected properties, it ends up not being worth the trouble: on ethical and practical grounds—don’t fight it. Use Google Custom Search JSON API or a compliant provider. Selenium can render pages, but using it to extract SERPs is still against Google’s terms and risks breakage and account penalties. It’s slower and costlier than doing the right thing with an API.
Conclusion
- Start simple: Requests + BeautifulSoup (or lxml) for static pages; prefer JSON endpoints you find in DevTools over parsing HTML.
- Scale responsibly: for many pages use httpx/aiohttp; for rule-driven crawls and pipelines use Scrapy.
- Bring a browser only when necessary: Playwright (often best) or Selenium for JS-rendered flows, waits, and interactions.
- Be ethical and durable: respect robots.txt/ToS, throttle, add retries with exponential backoff, cache during development, and log/validate results with typed models.
- Productionize: write resilient selectors, deduplicate by URL, store to SQLite/Postgres, and test parsers with fixtures so markup changes don’t silently break your jobs.
- When a site is protected (e.g., Google Search), don’t play cat-and-mouse—use official APIs or obtain permission.