Scraping LLM Training Data at Scale
Training and fine-tuning LLMs requires massive text corpora. Common Crawl covers the public web but has quality issues. Custom scraping fills the gaps — here is how to build a pipeline that actually produces training-grade data.
Every frontier LLM is trained on a heterogeneous mix: Common Crawl snapshots, Wikipedia dumps, code repositories, books, and licensed corpora. Common Crawl alone delivers roughly 10 billion pages per monthly snapshot — but it is not a drop-in training set. Teams at OpenAI, Anthropic, Meta AI, and Hugging Face spend significant engineering effort filtering, deduplicating, and supplementing it. This guide shows the concrete steps, with working Python and the tools the open-source community has standardized on.
1. Common Crawl: Strengths and Weaknesses
Common Crawl is a non-profit that publishes monthly crawls of the public web in WARC, WAT, and WET formats. Datasets like C4 (Google), RefinedWeb (TII), and FineWeb (Hugging Face) all start from it. Knowing what Common Crawl gives you — and what it doesn't — determines what you must scrape yourself.
Strengths
- ~10B pages per monthly snapshot
- Free, S3-hosted, permissively licensed
- Broad multilingual coverage (>100 languages)
- WARC preserves full HTTP response bodies
- WET extracts deliver plaintext directly
Weaknesses
- 3–6 month lag behind the live web
- Boilerplate-heavy (headers, nav, ads)
- Massive duplicate content across domains
- JavaScript-rendered sites often empty
- robots.txt exclusions reduce coverage
- Paywalled content mostly inaccessible
The Hugging Face FineWeb paper reports that only ~5–10% of Common Crawl bytes survive a production-grade quality filter. That number tells you how much work happens downstream.
2. What to Build Yourself
Custom scraping fills the gaps Common Crawl leaves behind. The four common categories:
- →Domain-specific corpora. Legal opinions (CourtListener), medical literature (PubMed Central OA), technical docs (ReadTheDocs, MDN), financial filings (SEC EDGAR). Higher density, less boilerplate.
- →Fresh content. News, forums, social. Anything less than 3 months old will not be in Common Crawl yet. Critical for retrieval-augmented and continual-learning setups.
- →JavaScript-heavy sites. Common Crawl uses a non-rendering fetcher. SPAs (React, Vue, Next.js client pages) appear almost empty. You need a headless browser (Playwright, Puppeteer) to collect these.
- →Publicly accessible but paywall-adjacent content. Many sites return different HTML to bots vs. real browsers. Mobile IPs + real User-Agents close that gap on content that is legally public.
3. Text Quality Filters
Raw HTML is not training data. Four filter stages separate usable text from noise.
| Stage | Tool | What it removes |
|---|---|---|
| Boilerplate | trafilatura, newspaper3k, readability-lxml | Nav, footer, ads, comments, cookie banners |
| Language ID | fastText (lid.176.bin), langdetect, CLD3 | Wrong-language pages, mojibake, gibberish |
| Perplexity | KenLM trained on Wikipedia/books | Low-quality machine text, keyword stuffing |
| Dedup | MinHash LSH (datasketch), SimHash, exact-substring | Near-duplicates, scraped mirrors, repeated blocks |
trafilatura consistently wins boilerplate-removal benchmarks (see the CleanEval and trafilatura comparison papers). For deduplication, the standard approach today is MinHash LSH at 5-gram or 13-gram shingle granularity — the same technique used by FineWeb and RefinedWeb.
4. Python Pipeline with Mobile Proxy
A minimal single-URL fetch-and-clean function. Extend this with async workers, a task queue (Celery, Arq, Dramatiq), and Parquet output for production use.
import requests
import trafilatura
from hashlib import sha256
PROXY = "http://USER:PASS@hostname:http_port"
proxies = {"http": PROXY, "https": PROXY}
def scrape_and_clean(url, proxies):
r = requests.get(url, proxies=proxies, timeout=30)
# Extract main content, strip boilerplate
text = trafilatura.extract(
r.text,
include_comments=False,
include_tables=False,
)
if text and len(text) > 500:
return {
"url": url,
"content": text,
"hash": sha256(text.encode()).hexdigest(),
"length": len(text),
}
return None
# Usage
doc = scrape_and_clean("https://example.com/article", proxies)
if doc:
print(f"Collected {doc['length']} chars from {doc['url']}")The 500-character minimum filters out stub pages and redirects. The SHA-256 gives you a cheap exact-match dedup key before spending CPU on MinHash comparisons.
5. Scale Considerations
- →Throughput. 10–100 requests per second requires distributed workers — the Python GIL is not your friend. Use asyncio+aiohttp for I/O-bound fetching, or spread across many processes with a shared Redis-backed queue.
- →Mobile proxy rotation. Rotate IP per worker, or per batch of N requests. Carrier-grade NAT already shares one IP across thousands of real users, so you blend into real traffic patterns. See IP rotation best practices for intervals.
- →Storage format. Parquet with Zstd compression is standard for training pipelines. Hugging Face
datasetsloads Parquet natively. Columnar beats JSONL for any serious corpus size. - →Retry strategy. Exponential backoff with full jitter, respect Retry-After headers, budget per-domain. Details in our backoff guide.
6. Ethical and Legal Considerations
Scraping for LLM training sits in a live legal debate. The practical baseline:
- →robots.txt. Most major labs (OpenAI via GPTBot, Anthropic via ClaudeBot, Google via Google-Extended) now publish dedicated user-agents and honor robots.txt directives. For your own crawler, do the same — it is the documented norm in RFC 9309.
- →Copyright. Authors Guild v. OpenAI, The New York Times v. OpenAI/Microsoft, and several class actions are pending. Courts have not yet resolved whether training on copyrighted text is fair use. See our legal overview.
- →Personal data. GDPR Article 4 defines personal data broadly; the EU AI Act treats training data governance explicitly. If your corpus contains names, emails, or identifying context about EU residents, you are a data controller.
- →Terms of Service. Separate from copyright. A site's ToS may prohibit scraping even when the content is lawful to access.
Related Guides
Scrape Training Data Without the Blocks
Mobile carrier IPs for serious LLM data collection. Start with a $5 trial, then scale.