Amazon Product Scraping Guide (2026)
Amazon runs one of the most aggressive anti-bot systems on the web: captchas, behavioral analysis, and aggressive per-IP blocking. Here's how to scrape ASINs, prices, and reviews at scale without getting blocked.
1. What You Can Extract
Amazon's product detail pages (PDPs) are HTML-rendered with a mix of server-rendered and JavaScript-hydrated content. The following fields are reliably extractable:
2. Amazon's Anti-Bot Layers
Amazon runs detection at multiple levels. Understanding each lets you target the right mitigation:
| Layer | What it does |
|---|---|
| IP reputation | AWS, GCP, and known datacenter ASNs are near-instantly captcha'd or 503'd. |
| Behavioral detection | Mouse curvature, scroll velocity, request-timing entropy tracked via JS. |
| Amazon captcha | Self-hosted image captcha at /errors/validateCaptcha — not reCAPTCHA. |
| Rate limiting | Aggressive per-IP thresholds; 503 + Robot Check page after ~30 fast requests. |
| Session tracking | session-id, ubid-main, i18n-prefs cookies — persist or clear strategically. |
3. Python requests + Mobile Proxy
For server-rendered fields (title, price, rating) plain requests + BeautifulSoup is enough. Route traffic through a mobile proxy so Amazon sees a real carrier IP, not AWS.
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15",
"Accept-Language": "en-US,en;q=0.9",
}
proxies = {
"http": "http://USER:PASS@hostname:http_port",
"https": "http://USER:PASS@hostname:http_port",
}
def scrape_amazon_product(asin):
url = f"https://www.amazon.com/dp/{asin}"
r = requests.get(url, headers=headers, proxies=proxies, timeout=30)
if r.status_code != 200:
return None
soup = BeautifulSoup(r.text, "html.parser")
title = soup.select_one("#productTitle")
price = soup.select_one(".a-price .a-offscreen")
return {
"asin": asin,
"title": title.get_text(strip=True) if title else None,
"price": price.get_text(strip=True) if price else None,
}The User-Agent should match real consumer devices. Desktop UA + mobile carrier IP is an inconsistency Amazon can flag during behavioral review.
4. Playwright for JS-Hydrated Fields
Buy Box price, "frequently bought together", and some variant prices load via client-side JS after initial HTML. Use Playwright to render and then query:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(proxy={
"server": "http://hostname:http_port",
"username": "USER",
"password": "PASS",
})
page = browser.new_page()
page.goto("https://www.amazon.com/dp/B08N5WRWNW")
price = page.locator(".a-price .a-offscreen").first.inner_text()
browser.close()Playwright ships a real Chrome bundle, so JA3/JA4 TLS fingerprints match consumer browsers. Combined with a mobile carrier IP this is the hardest fingerprint to classify as "bot".
5. IP Rotation Strategy
Amazon throttles per-IP. Rotate aggressively on catalog scrapes, stay sticky when a session matters (review pagination, seller storefronts). The MobileProxies.org API exposes a rotation endpoint:
import requests, time
API = "https://buy.mobileproxies.org"
TOKEN = "YOUR_API_KEY"
def rotate(slot_id):
r = requests.post(
f"{API}/api/v1/proxies/{slot_id}/switch",
headers={"Authorization": f"Bearer {TOKEN}"},
timeout=30,
)
r.raise_for_status()
# Wait for modem to reconnect on the new IP
time.sleep(10)
return r.json()
def list_slots():
r = requests.get(
f"{API}/api/v1/proxies",
headers={"Authorization": f"Bearer {TOKEN}"},
timeout=30,
)
return r.json()- →Catalog crawl: rotate every 20–30 requests.
- →Review pagination: sticky session until the last page, then rotate.
- →Post-rotation wait: ~10s for the modem to reconnect on a new carrier IP.
6. Rate Limiting & Backoff
Even on mobile IPs, hammer Amazon and you'll still hit the Robot Check page. Space requests with jitter and back off exponentially on 503:
import random, time
def polite_get(url, session, max_retries=5):
for attempt in range(max_retries):
time.sleep(random.uniform(2, 5)) # random human-ish gap
r = session.get(url, timeout=30)
if r.status_code == 200:
return r
if r.status_code in (503, 429):
backoff = (2 ** attempt) + random.random()
time.sleep(backoff)
continue
r.raise_for_status()
raise RuntimeError("max retries exceeded")2–5 second random gaps between requests is a reasonable baseline. Exponential backoff with full jitter on 503/429 prevents thundering-herd retries.
7. Common Mistakes
- ×Using datacenter IPs. AWS/GCP/OVH are flagged almost instantly on Amazon. Use mobile or residential.
- ×Mismatched User-Agent. Desktop UA + mobile carrier IP looks inconsistent. Match the UA to a plausible device.
- ×Rotating too fast. Rotating every single request defeats review pagination and triggers session-anomaly heuristics.
- ×Ignoring cookies. Amazon sets session-id on first visit. Keeping it across page views looks human; wiping it every request does not.
- ×Scraping signed-in prices. Prime pricing differs from logged-out pricing. Know which number you actually need.
Related Guides
Scrape Amazon Without Getting Blocked
Real carrier IPs, one-call rotation, sticky sessions when you need them. Test it for $5.