Web Scraper Pipeline
Pipeline Overview
Problem
Public stats sites rate-limit and change markup. I needed a resilient scraper with schema-first exports so my analysis doesn't break.
Approach
- Headless browser: Selenium for JS-rendered content; Requests for simple pages.
- Anti-fragility: user-agent rotation, polite delays, and selectors guarded by try/catch.
- Schema: column validation with dtype coercion; null handling; idempotent writes.
- Storage: CSV + SQLite snapshot each run; dedupe on primary keys.
- Automation: CLI entrypoint; can run via GitHub Actions or cron.
Example Code
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd, time
def fetch_table(url):
opts = Options(); opts.add_argument("--headless=new")
with webdriver.Chrome(options=opts) as d:
d.get(url); time.sleep(1.2)
rows = [r.text.split() for r in d.find_elements("css selector","table tbody tr")]
df = pd.DataFrame(rows) # clean + coerce schema below
return df
Full repo link here when public.
Result
Repeatable extracts with consistent schemas feeding my lacrosse and NFL analyses. Easy to swap targets by changing selectors + schema config.