Amazon Product Scraping Guide (2026)

Amazon runs one of the most aggressive anti-bot systems on the web: captchas, behavioral analysis, and aggressive per-IP blocking. Here's how to scrape ASINs, prices, and reviews at scale without getting blocked.

12 min read·Python, Playwright, mobile proxies·Last updated: April 2026

1. What You Can Extract

Amazon's product detail pages (PDPs) are HTML-rendered with a mix of server-rendered and JavaScript-hydrated content. The following fields are reliably extractable:

Product title and ASIN

Price (list, sale, Buy Box)

Availability / stock status

Star rating and review count

Individual review text + dates

Buy Box ownership (3P vs FBA)

Seller name and rating

Variant SKUs (size, color, style)

Image URLs (high-res gallery)

Product description & A+ content

Q&A section

2. Amazon's Anti-Bot Layers

Amazon runs detection at multiple levels. Understanding each lets you target the right mitigation:

Layer	What it does
IP reputation	AWS, GCP, and known datacenter ASNs are near-instantly captcha'd or 503'd.
Behavioral detection	Mouse curvature, scroll velocity, request-timing entropy tracked via JS.
Amazon captcha	Self-hosted image captcha at /errors/validateCaptcha — not reCAPTCHA.
Rate limiting	Aggressive per-IP thresholds; 503 + Robot Check page after ~30 fast requests.
Session tracking	session-id, ubid-main, i18n-prefs cookies — persist or clear strategically.

3. Python requests + Mobile Proxy

For server-rendered fields (title, price, rating) plain requests + BeautifulSoup is enough. Route traffic through a mobile proxy so Amazon sees a real carrier IP, not AWS.

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15",
    "Accept-Language": "en-US,en;q=0.9",
}
proxies = {
    "http": "http://USER:PASS@hostname:http_port",
    "https": "http://USER:PASS@hostname:http_port",
}

def scrape_amazon_product(asin):
    url = f"https://www.amazon.com/dp/{asin}"
    r = requests.get(url, headers=headers, proxies=proxies, timeout=30)
    if r.status_code != 200:
        return None
    soup = BeautifulSoup(r.text, "html.parser")
    title = soup.select_one("#productTitle")
    price = soup.select_one(".a-price .a-offscreen")
    return {
        "asin": asin,
        "title": title.get_text(strip=True) if title else None,
        "price": price.get_text(strip=True) if price else None,
    }

The User-Agent should match real consumer devices. Desktop UA + mobile carrier IP is an inconsistency Amazon can flag during behavioral review.

4. Playwright for JS-Hydrated Fields

Buy Box price, "frequently bought together", and some variant prices load via client-side JS after initial HTML. Use Playwright to render and then query:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(proxy={
        "server": "http://hostname:http_port",
        "username": "USER",
        "password": "PASS",
    })
    page = browser.new_page()
    page.goto("https://www.amazon.com/dp/B08N5WRWNW")
    price = page.locator(".a-price .a-offscreen").first.inner_text()
    browser.close()

Playwright ships a real Chrome bundle, so JA3/JA4 TLS fingerprints match consumer browsers. Combined with a mobile carrier IP this is the hardest fingerprint to classify as "bot".

5. IP Rotation Strategy

Amazon throttles per-IP. Rotate aggressively on catalog scrapes, stay sticky when a session matters (review pagination, seller storefronts). The MobileProxies.org API exposes a rotation endpoint:

import requests, time

API = "https://buy.mobileproxies.org"
TOKEN = "YOUR_API_KEY"

def rotate(slot_id):
    r = requests.post(
        f"{API}/api/v1/proxies/{slot_id}/switch",
        headers={"Authorization": f"Bearer {TOKEN}"},
        timeout=30,
    )
    r.raise_for_status()
    # Wait for modem to reconnect on the new IP
    time.sleep(10)
    return r.json()

def list_slots():
    r = requests.get(
        f"{API}/api/v1/proxies",
        headers={"Authorization": f"Bearer {TOKEN}"},
        timeout=30,
    )
    return r.json()

→
Catalog crawl: rotate every 20–30 requests.
→
Review pagination: sticky session until the last page, then rotate.
→
Post-rotation wait: ~10s for the modem to reconnect on a new carrier IP.

6. Rate Limiting & Backoff

Even on mobile IPs, hammer Amazon and you'll still hit the Robot Check page. Space requests with jitter and back off exponentially on 503:

import random, time

def polite_get(url, session, max_retries=5):
    for attempt in range(max_retries):
        time.sleep(random.uniform(2, 5))  # random human-ish gap
        r = session.get(url, timeout=30)
        if r.status_code == 200:
            return r
        if r.status_code in (503, 429):
            backoff = (2 ** attempt) + random.random()
            time.sleep(backoff)
            continue
        r.raise_for_status()
    raise RuntimeError("max retries exceeded")

2–5 second random gaps between requests is a reasonable baseline. Exponential backoff with full jitter on 503/429 prevents thundering-herd retries.

7. Common Mistakes

×
Using datacenter IPs. AWS/GCP/OVH are flagged almost instantly on Amazon. Use mobile or residential.
×
Mismatched User-Agent. Desktop UA + mobile carrier IP looks inconsistent. Match the UA to a plausible device.
×
Rotating too fast. Rotating every single request defeats review pagination and triggers session-anomaly heuristics.
×
Ignoring cookies. Amazon sets session-id on first visit. Keeping it across page views looks human; wiping it every request does not.
×
Scraping signed-in prices. Prime pricing differs from logged-out pricing. Know which number you actually need.

Related Guides

Technical Deep-Dive

Scrape Amazon Without Getting Blocked

Real carrier IPs, one-call rotation, sticky sessions when you need them. Test it for $5.

Try for $5 View plans →