Web Scraper Pipeline

Pipeline Overview

Problem

Public stats sites rate-limit and change markup. I needed a resilient scraper with schema-first exports so my analysis doesn't break.

Approach

  • Headless browser: Selenium for JS-rendered content; Requests for simple pages.
  • Anti-fragility: user-agent rotation, polite delays, and selectors guarded by try/catch.
  • Schema: column validation with dtype coercion; null handling; idempotent writes.
  • Storage: CSV + SQLite snapshot each run; dedupe on primary keys.
  • Automation: CLI entrypoint; can run via GitHub Actions or cron.

Example Code


from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd, time

def fetch_table(url):
    opts = Options(); opts.add_argument("--headless=new")
    with webdriver.Chrome(options=opts) as d:
        d.get(url); time.sleep(1.2)
        rows = [r.text.split() for r in d.find_elements("css selector","table tbody tr")]
    df = pd.DataFrame(rows)  # clean + coerce schema below
    return df

Full repo link here when public.

Result

Repeatable extracts with consistent schemas feeding my lacrosse and NFL analyses. Easy to swap targets by changing selectors + schema config.