Building a SERP Rank Tracker: Architecture Guide
Ahrefs, SEMrush, Sistrix, Serpstat — the rank-tracker industry is worth billions because the problem is harder than it looks. Here's the architecture that makes tracking millions of keywords a day tractable.
A rank tracker's job is simple to describe — for each keyword in a customer's project, query Google from the right location, parse the SERP, record the rank for the customer's domain, and raise an alert if anything moves meaningfully. At 100 keywords it's a cron job. At 10 million it's a distributed system with proxy pools, retry logic, parser versioning, and time-series storage measured in TB. The architecture stays the same; the components scale independently.
1. Core Components
| Component | Typical choice | Responsibility |
|---|---|---|
| Scheduler | Cron / Airflow | Decides which keywords are due for checking |
| Keyword queue | Redis / SQS / RabbitMQ | Work queue for scraper workers |
| Scraper workers | Python / Go, stateless | Execute queries, return raw HTML |
| Proxy pool | Mobile proxies + rotation API | Supplies IPs, rotates on request or failure |
| Parser | Versioned library | Converts HTML → structured ranks + SERP features |
| Database | PostgreSQL / TimescaleDB | Stores rank history as time series |
| Alerting | Event bus + rule engine | Notifies on rank drops, new SERP features |
| CAPTCHA fallback | 2Captcha / CapSolver | Handles challenges that slip past the proxy layer |
Each component scales horizontally on its own axis. Bottlenecks move as you grow — typically parser first (regex too slow), then database (write throughput), then proxy pool (429s under load).
2. Queue & Scheduling Design
Three scheduling dimensions matter: cadence (how often each keyword is re-checked), locality (which geo to query from), and device (desktop vs mobile SERP).
- →Daily for tracked keywords; weekly for long-tail; hourly for news/trending. Commercial tools expose this tiering via plan price.
- →Round-robin by geo: each worker picks up jobs for the IP pool it's attached to. A US-mobile-IP worker only takes US queries.
- →Spread load across the day: don't run every daily job at 00:00 — bucket jobs across 24 hours based on a hash of keyword_id. Smooths proxy pool load.
- →Deduplication: if three customers track the same keyword in the same geo, query once and fan out the rank lookup.
3. Proxy Rotation Strategy
Google rewards consistency within a session (NID cookie persistence) and punishes repetition across sessions (same IP + same fingerprint = throttle). The compromise: sticky per-query, rotate between queries.
- • One query = one IP: per-keyword mobile IP assignment, full request cycle on that IP
- • Rotate between queries: hit the rotation endpoint, pick up a fresh IP for the next keyword
- • Retry on fresh IP: any 429 / 503 / sorry-redirect triggers immediate IP rotation and a short back-off (exponential, jittered)
- • Budget per IP: cap each IP at ~30 queries/hour before forcing rotation, even without failures
Mobileproxies.org exposes rotation via API at https://buy.mobileproxies.org/ — a single HTTP call to the rotation endpoint returns a new carrier IP assigned to your port. Workers call it between queries.
4. Time-Series Storage
Rank history is effectively append-only time-series data. Two shapes to pick from:
- →PostgreSQL with partitioning: native, cheap, great up to ~1B rows. Partition by month on a
rank_historytable — drop old partitions in one statement. - →TimescaleDB: Postgres extension with hypertables, continuous aggregates, automatic compression. The standard choice once you pass ~100M new rows/month.
- →ClickHouse: if you need sub-second aggregate queries across billions of rows (SEMrush-scale), columnar storage wins.
Store both the rank and the full SERP snapshot (compressed HTML or structured JSON of all result blocks). Customers frequently ask "what did the SERP look like on the day our rank dropped?" — you want that answer.
5. Alerting on Movement
- →Rank drops: threshold alert (e.g., dropped > 5 positions day-over-day)
- →Lost snippet: previously owned Featured Snippet now points elsewhere
- →New SERP feature: a Local Pack, Shopping row, or Knowledge Graph appeared for a query that didn't have one
- →New competitor: a domain entered the top 10 for the first time
6. Cost Model: 100K Keywords / Day
| Line item | Scale |
|---|---|
| Queries | 100K/day × 30 days = 3M/month |
| Proxy traffic | ~300 KB HTML per query → ~900 GB/month mobile bandwidth |
| Workers | ~3-5 concurrent Python workers per 100K/day slice |
| Storage growth | 3M rows/month in rank_history; 900 GB compressed SERP snapshots |
Third-party SERP APIs charge $1-3 per 1K queries. 3M/month through a SERP API is $3K-9K/month. Equivalent in-house with mobile proxies is usually half to a third of that once bandwidth and worker infra are counted — and you own the parser, which matters when Google ships SERP feature changes you want tracked before the vendor updates.
7. Minimal Producer / Consumer Skeleton
The moving parts in one file — Redis queue, mobile proxy rotation, rank lookup, storage. Good enough to run in a single container for a few thousand keywords; structurally the same shape you'd scale up.
import requests, redis, json, time, random
from bs4 import BeautifulSoup
from urllib.parse import quote_plus
r = redis.Redis(host="localhost", port=6379, decode_responses=True)
ROTATE_URL = "https://buy.mobileproxies.org/api/rotate" # placeholder — see docs
PROXY = "http://user:pass@proxy.mobileproxies.org:8000"
proxies = {"http": PROXY, "https": PROXY}
HEADERS = {
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_5 like Mac OS X) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1",
"Accept-Language": "en-US,en;q=0.9",
}
def rotate_ip():
# Call your provider's rotation endpoint; mobileproxies exposes one per port.
requests.post(ROTATE_URL, auth=("user", "pass"), timeout=10)
def fetch_serp(query, gl="us", hl="en"):
url = f"https://www.google.com/search?q={quote_plus(query)}&num=50&gl={gl}&hl={hl}"
for attempt in range(3):
try:
resp = requests.get(url, headers=HEADERS, proxies=proxies, timeout=20)
if resp.status_code == 200 and "/sorry/" not in resp.url:
return resp.text
except requests.RequestException:
pass
rotate_ip()
time.sleep((2 ** attempt) + random.random())
return None
def find_rank(html, target_domain):
soup = BeautifulSoup(html, "lxml")
for i, block in enumerate(soup.select("div.g"), start=1):
link = block.select_one("a[href]")
if link and target_domain in link.get("href", ""):
return i
return None
def worker():
while True:
raw = r.blpop("keyword_queue", timeout=0)
if not raw:
continue
job = json.loads(raw[1]) # {keyword, domain, gl, project_id}
html = fetch_serp(job["keyword"], gl=job["gl"])
if html is None:
r.rpush("dead_letter", json.dumps(job))
continue
rank = find_rank(html, job["domain"])
r.rpush("rank_results", json.dumps({
"project_id": job["project_id"],
"keyword": job["keyword"],
"rank": rank,
"ts": int(time.time()),
}))
time.sleep(random.uniform(3, 6))
rotate_ip()
if __name__ == "__main__":
worker()
From here, a second worker drains rank_results into Postgres/TimescaleDB, and a third compares yesterday's rank with today's to emit alert events. Each worker scales independently.
Related Guides
Proxy Pool for Your Rank Tracker
13-geo mobile IP coverage, API rotation, sticky sessions. The proxy layer Ahrefs-class infrastructure runs on.