Collecting LLM Benchmark Data at Scale
Existing benchmarks (MMLU, HumanEval, HELM) age fast. Fresh benchmark data requires ongoing scraping from source communities. Here is how to build an eval dataset pipeline that feeds Hugging Face Datasets — with mobile proxies for the parts where vanilla requests would get blocked.
Benchmark contamination is the quiet crisis in LLM evaluation. If a benchmark's test items leaked into a model's training data — which is statistically near-certain for anything published before the training cutoff — scores become meaningless. Fresh, post-cutoff evaluation data is the only reliable signal. That means scraping.
1. Known Benchmarks and Their Limits
| Benchmark | Scope | Limit |
|---|---|---|
| MMLU | 57 subjects, multi-task language understanding (Hendrycks et al., 2021) | Largely static since 2021; saturated by frontier models |
| HumanEval | Python code-generation (OpenAI, 2021) | 164 hand-curated tasks; saturation and contamination |
| HELM | Stanford holistic eval framework across many scenarios | Framework rather than a single fresh dataset |
| BIG-Bench | 204+ community-contributed tasks | Published to GitHub; in training data by now |
| MMLU-Pro | Harder MMLU refresh (TIGER-Lab) | Newer but still finite and public |
The pattern: any public benchmark eventually gets absorbed into training corpora. Fresh, privately-held, or post-cutoff data is the only durable eval.
2. Building Fresh Eval Data
Four reliable sources produce useful, continuously updated eval material:
- →Stack Overflow questions. Scrape accepted-answer pairs posted after a target cutoff. Excellent code-eval material — real developer questions with human-accepted solutions.
- →Reddit /r/AskScience and /r/explainlikeimfive. Open-ended Q&A with expert-flaired answers. Good for factuality and explanation-quality evals.
- →ArXiv abstracts. Fresh scientific summarization material. Pull the full paper, use the abstract as ground-truth summary, evaluate model-generated summaries against it.
- →Wikipedia revisions. Use the revision API to find articles edited after the model's training cutoff. The edit is almost certainly post-cutoff factual knowledge.
3. Python Pipeline: Stack Overflow Q&A Pairs
Scrape the HTML of a question page, pull the body and the accepted answer, and write a row to your eval dataset. Use a mobile proxy — Stack Overflow rate-limits aggressively and their Cloudflare policy flags datacenter IPs quickly.
import requests
from bs4 import BeautifulSoup
PROXY = "http://USER:PASS@hostname:http_port"
proxies = {"http": PROXY, "https": PROXY}
def scrape_so_question(qid, proxies):
r = requests.get(
f"https://stackoverflow.com/questions/{qid}",
proxies=proxies,
headers={"User-Agent": "Mozilla/5.0"},
timeout=30,
)
soup = BeautifulSoup(r.text, "html.parser")
question = soup.select_one(".question .js-post-body")
accepted = soup.select_one(".accepted-answer .js-post-body")
return {
"qid": qid,
"question": question.get_text(strip=True) if question else None,
"answer": accepted.get_text(strip=True) if accepted else None,
}
# Example
row = scrape_so_question(77000000, proxies)
print(row["question"][:200] if row["question"] else "no body")For larger runs, walk the Stack Exchange API (api.stackexchange.com) for question IDs in a date range, then hydrate the HTML for bodies (the API strips some formatting). Respect the API's quota and attribution requirements.
4. Hugging Face Datasets Integration
Once rows are collected, convert to a datasets.Datasetand push to the Hub (private repo recommended for eval material — you do not want your test set contaminating future models).
from datasets import Dataset
rows = [scrape_so_question(qid, proxies) for qid in candidate_ids]
rows = [r for r in rows if r["question"] and r["answer"]]
ds = Dataset.from_list(rows)
ds.push_to_hub(
"my-org/fresh-code-eval",
private=True, # keep eval sets private to avoid training-set contamination
token=os.environ["HF_TOKEN"],
)The datasets library stores Parquet under the hood, supports streaming, and is directly consumable by every popular eval harness (lm-evaluation-harness, OpenAI evals, Inspect AI).
5. Quality Filters
Raw scraped rows are not eval-ready. Apply at minimum:
- →Min/max length. Drop one-liner questions and multi-thousand-line essays. Aim for 80–2000 tokens in the question, 40–1500 in the answer.
- →Language match. fastText lid.176 — filter to your target language.
- →Code-block hygiene. For code evals, make sure the accepted answer actually contains code (
<pre><code>) and not just prose. - →Date filter. Keep only posts after the target model's training cutoff.
- →PII scrub. Regex for emails, phone numbers, API keys. Presidio for named-entity-based PII removal.
6. The Evaluation Loop
End-to-end: prompt → model → compare with ground truth → score.
| Task | Scoring approach |
|---|---|
| Code | Execute model code against unit tests (HumanEval-style pass@k) |
| Factual QA | Exact match, F1 on key entities, LLM-as-judge |
| Summarization | ROUGE-L, BERTScore, LLM-as-judge rubric |
| Explanation quality | LLM-as-judge with fixed rubric; human spot-check 5% |
Track scores by collection date. If your aggregate score on newly-scraped items drops by 10+ points vs. pre-cutoff items, you've likely detected contamination in the older data — which is exactly why fresh scraping matters.
Related Guides
Fresh Eval Data Needs Reliable Scraping
Real mobile IPs that collect Stack Overflow, Reddit, ArXiv without blocks. Try it for $5.