Scraping LLM Training Data at Scale

Training and fine-tuning LLMs requires massive text corpora. Common Crawl covers the public web but has quality issues. Custom scraping fills the gaps — here is how to build a pipeline that actually produces training-grade data.

13 min read·References: Common Crawl, trafilatura, KenLM, MinHash LSH, fastText·Last updated: April 2026

Every frontier LLM is trained on a heterogeneous mix: Common Crawl snapshots, Wikipedia dumps, code repositories, books, and licensed corpora. Common Crawl alone delivers roughly 10 billion pages per monthly snapshot — but it is not a drop-in training set. Teams at OpenAI, Anthropic, Meta AI, and Hugging Face spend significant engineering effort filtering, deduplicating, and supplementing it. This guide shows the concrete steps, with working Python and the tools the open-source community has standardized on.

1. Common Crawl: Strengths and Weaknesses

Common Crawl is a non-profit that publishes monthly crawls of the public web in WARC, WAT, and WET formats. Datasets like C4 (Google), RefinedWeb (TII), and FineWeb (Hugging Face) all start from it. Knowing what Common Crawl gives you — and what it doesn't — determines what you must scrape yourself.

Strengths

~10B pages per monthly snapshot
Free, S3-hosted, permissively licensed
Broad multilingual coverage (>100 languages)
WARC preserves full HTTP response bodies
WET extracts deliver plaintext directly

Weaknesses

3–6 month lag behind the live web
Boilerplate-heavy (headers, nav, ads)
Massive duplicate content across domains
JavaScript-rendered sites often empty
robots.txt exclusions reduce coverage
Paywalled content mostly inaccessible

The Hugging Face FineWeb paper reports that only ~5–10% of Common Crawl bytes survive a production-grade quality filter. That number tells you how much work happens downstream.

2. What to Build Yourself

Custom scraping fills the gaps Common Crawl leaves behind. The four common categories:

→
Domain-specific corpora. Legal opinions (CourtListener), medical literature (PubMed Central OA), technical docs (ReadTheDocs, MDN), financial filings (SEC EDGAR). Higher density, less boilerplate.
→
Fresh content. News, forums, social. Anything less than 3 months old will not be in Common Crawl yet. Critical for retrieval-augmented and continual-learning setups.
→
JavaScript-heavy sites. Common Crawl uses a non-rendering fetcher. SPAs (React, Vue, Next.js client pages) appear almost empty. You need a headless browser (Playwright, Puppeteer) to collect these.
→
Publicly accessible but paywall-adjacent content. Many sites return different HTML to bots vs. real browsers. Mobile IPs + real User-Agents close that gap on content that is legally public.

3. Text Quality Filters

Raw HTML is not training data. Four filter stages separate usable text from noise.

Stage	Tool	What it removes
Boilerplate	trafilatura, newspaper3k, readability-lxml	Nav, footer, ads, comments, cookie banners
Language ID	fastText (lid.176.bin), langdetect, CLD3	Wrong-language pages, mojibake, gibberish
Perplexity	KenLM trained on Wikipedia/books	Low-quality machine text, keyword stuffing
Dedup	MinHash LSH (datasketch), SimHash, exact-substring	Near-duplicates, scraped mirrors, repeated blocks

trafilatura consistently wins boilerplate-removal benchmarks (see the CleanEval and trafilatura comparison papers). For deduplication, the standard approach today is MinHash LSH at 5-gram or 13-gram shingle granularity — the same technique used by FineWeb and RefinedWeb.

4. Python Pipeline with Mobile Proxy

A minimal single-URL fetch-and-clean function. Extend this with async workers, a task queue (Celery, Arq, Dramatiq), and Parquet output for production use.

import requests
import trafilatura
from hashlib import sha256

PROXY = "http://USER:PASS@hostname:http_port"
proxies = {"http": PROXY, "https": PROXY}

def scrape_and_clean(url, proxies):
    r = requests.get(url, proxies=proxies, timeout=30)
    # Extract main content, strip boilerplate
    text = trafilatura.extract(
        r.text,
        include_comments=False,
        include_tables=False,
    )
    if text and len(text) > 500:
        return {
            "url": url,
            "content": text,
            "hash": sha256(text.encode()).hexdigest(),
            "length": len(text),
        }
    return None

# Usage
doc = scrape_and_clean("https://example.com/article", proxies)
if doc:
    print(f"Collected {doc['length']} chars from {doc['url']}")

The 500-character minimum filters out stub pages and redirects. The SHA-256 gives you a cheap exact-match dedup key before spending CPU on MinHash comparisons.

5. Scale Considerations

→
Throughput. 10–100 requests per second requires distributed workers — the Python GIL is not your friend. Use asyncio+aiohttp for I/O-bound fetching, or spread across many processes with a shared Redis-backed queue.
→
Mobile proxy rotation. Rotate IP per worker, or per batch of N requests. Carrier-grade NAT already shares one IP across thousands of real users, so you blend into real traffic patterns. See IP rotation best practices for intervals.
→
Storage format. Parquet with Zstd compression is standard for training pipelines. Hugging Face datasets loads Parquet natively. Columnar beats JSONL for any serious corpus size.
→
Retry strategy. Exponential backoff with full jitter, respect Retry-After headers, budget per-domain. Details in our backoff guide.

6. Ethical and Legal Considerations

Scraping for LLM training sits in a live legal debate. The practical baseline:

→
robots.txt. Most major labs (OpenAI via GPTBot, Anthropic via ClaudeBot, Google via Google-Extended) now publish dedicated user-agents and honor robots.txt directives. For your own crawler, do the same — it is the documented norm in RFC 9309.
→
Copyright. Authors Guild v. OpenAI, The New York Times v. OpenAI/Microsoft, and several class actions are pending. Courts have not yet resolved whether training on copyrighted text is fair use. See our legal overview.
→
Personal data. GDPR Article 4 defines personal data broadly; the EU AI Act treats training data governance explicitly. If your corpus contains names, emails, or identifying context about EU residents, you are a data controller.
→
Terms of Service. Separate from copyright. A site's ToS may prohibit scraping even when the content is lawful to access.

Related Guides

Legal

Scrape Training Data Without the Blocks

Mobile carrier IPs for serious LLM data collection. Start with a $5 trial, then scale.

Try for $5 View plans →