Home/Blog/LLM Benchmark Data Collection
AI & Data

Collecting LLM Benchmark Data at Scale

Existing benchmarks (MMLU, HumanEval, HELM) age fast. Fresh benchmark data requires ongoing scraping from source communities. Here is how to build an eval dataset pipeline that feeds Hugging Face Datasets — with mobile proxies for the parts where vanilla requests would get blocked.

12 min read·MMLU, HumanEval, HELM, BIG-Bench, Hugging Face Datasets·Last updated: April 2026

Benchmark contamination is the quiet crisis in LLM evaluation. If a benchmark's test items leaked into a model's training data — which is statistically near-certain for anything published before the training cutoff — scores become meaningless. Fresh, post-cutoff evaluation data is the only reliable signal. That means scraping.

1. Known Benchmarks and Their Limits

BenchmarkScopeLimit
MMLU57 subjects, multi-task language understanding (Hendrycks et al., 2021)Largely static since 2021; saturated by frontier models
HumanEvalPython code-generation (OpenAI, 2021)164 hand-curated tasks; saturation and contamination
HELMStanford holistic eval framework across many scenariosFramework rather than a single fresh dataset
BIG-Bench204+ community-contributed tasksPublished to GitHub; in training data by now
MMLU-ProHarder MMLU refresh (TIGER-Lab)Newer but still finite and public

The pattern: any public benchmark eventually gets absorbed into training corpora. Fresh, privately-held, or post-cutoff data is the only durable eval.

2. Building Fresh Eval Data

Four reliable sources produce useful, continuously updated eval material:

  • Stack Overflow questions. Scrape accepted-answer pairs posted after a target cutoff. Excellent code-eval material — real developer questions with human-accepted solutions.
  • Reddit /r/AskScience and /r/explainlikeimfive. Open-ended Q&A with expert-flaired answers. Good for factuality and explanation-quality evals.
  • ArXiv abstracts. Fresh scientific summarization material. Pull the full paper, use the abstract as ground-truth summary, evaluate model-generated summaries against it.
  • Wikipedia revisions. Use the revision API to find articles edited after the model's training cutoff. The edit is almost certainly post-cutoff factual knowledge.

3. Python Pipeline: Stack Overflow Q&A Pairs

Scrape the HTML of a question page, pull the body and the accepted answer, and write a row to your eval dataset. Use a mobile proxy — Stack Overflow rate-limits aggressively and their Cloudflare policy flags datacenter IPs quickly.

import requests from bs4 import BeautifulSoup PROXY = "http://USER:PASS@hostname:http_port" proxies = {"http": PROXY, "https": PROXY} def scrape_so_question(qid, proxies): r = requests.get( f"https://stackoverflow.com/questions/{qid}", proxies=proxies, headers={"User-Agent": "Mozilla/5.0"}, timeout=30, ) soup = BeautifulSoup(r.text, "html.parser") question = soup.select_one(".question .js-post-body") accepted = soup.select_one(".accepted-answer .js-post-body") return { "qid": qid, "question": question.get_text(strip=True) if question else None, "answer": accepted.get_text(strip=True) if accepted else None, } # Example row = scrape_so_question(77000000, proxies) print(row["question"][:200] if row["question"] else "no body")

For larger runs, walk the Stack Exchange API (api.stackexchange.com) for question IDs in a date range, then hydrate the HTML for bodies (the API strips some formatting). Respect the API's quota and attribution requirements.

4. Hugging Face Datasets Integration

Once rows are collected, convert to a datasets.Datasetand push to the Hub (private repo recommended for eval material — you do not want your test set contaminating future models).

from datasets import Dataset rows = [scrape_so_question(qid, proxies) for qid in candidate_ids] rows = [r for r in rows if r["question"] and r["answer"]] ds = Dataset.from_list(rows) ds.push_to_hub( "my-org/fresh-code-eval", private=True, # keep eval sets private to avoid training-set contamination token=os.environ["HF_TOKEN"], )

The datasets library stores Parquet under the hood, supports streaming, and is directly consumable by every popular eval harness (lm-evaluation-harness, OpenAI evals, Inspect AI).

5. Quality Filters

Raw scraped rows are not eval-ready. Apply at minimum:

  • Min/max length. Drop one-liner questions and multi-thousand-line essays. Aim for 80–2000 tokens in the question, 40–1500 in the answer.
  • Language match. fastText lid.176 — filter to your target language.
  • Code-block hygiene. For code evals, make sure the accepted answer actually contains code (<pre><code>) and not just prose.
  • Date filter. Keep only posts after the target model's training cutoff.
  • PII scrub. Regex for emails, phone numbers, API keys. Presidio for named-entity-based PII removal.

6. The Evaluation Loop

End-to-end: prompt → model → compare with ground truth → score.

TaskScoring approach
CodeExecute model code against unit tests (HumanEval-style pass@k)
Factual QAExact match, F1 on key entities, LLM-as-judge
SummarizationROUGE-L, BERTScore, LLM-as-judge rubric
Explanation qualityLLM-as-judge with fixed rubric; human spot-check 5%

Track scores by collection date. If your aggregate score on newly-scraped items drops by 10+ points vs. pre-cutoff items, you've likely detected contamination in the older data — which is exactly why fresh scraping matters.

Related Guides

Fresh Eval Data Needs Reliable Scraping

Real mobile IPs that collect Stack Overflow, Reddit, ArXiv without blocks. Try it for $5.