Shopify Store Scraping & Product Data
Most Shopify stores expose their entire catalog through a single public JSON endpoint. Here's how to use it responsibly, what to do when it's disabled, and what the data actually contains.
1. The /products.json Endpoint
Every Shopify store ships with a public JSON endpoint at /products.json that returns the full catalog in structured form — up to 250 products per page with pagination. It was originally designed to power app integrations, but it's publicly accessible on nearly every store.
This is a legal gray area in the same sense as any other publicly-served JSON — respect each store's Terms of Service, don't overwhelm their origin, and use the data only for lawful purposes (research, price comparison, personal analytics).
Note: Store owners can disable this endpoint via the theme or an app. If it returns 404 or an empty products array, fall back to the sitemap approach in section 3.
2. Python: Paginating the Catalog
The JSON endpoint supports limit (max 250) and page parameters. Paginate until the products array comes back empty:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15",
"Accept": "application/json",
}
proxies = {
"http": "http://USER:PASS@hostname:http_port",
"https": "http://USER:PASS@hostname:http_port",
}
def scrape_shopify_products(store_url):
all_products = []
page = 1
while True:
url = f"{store_url}/products.json?limit=250&page={page}"
r = requests.get(url, headers=headers, proxies=proxies, timeout=30)
if r.status_code != 200:
break
data = r.json()
products = data.get("products", [])
if not products:
break
all_products.extend(products)
page += 1
return all_products
# usage
catalog = scrape_shopify_products("https://examplestore.com")
print(f"fetched {len(catalog)} products")Routing through a mobile proxy keeps you out of per-IP rate limits on large catalogs and prevents accidental flags if you're crawling many stores from the same origin.
3. Fallback: sitemap.xml & Storefront
When /products.jsonis disabled, Shopify still publishes a sitemap that indexes every product URL. Start at /sitemap.xml and follow the sitemap_products_*.xml children:
import requests
import xml.etree.ElementTree as ET
NS = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
def product_urls_from_sitemap(store_url):
urls = []
index = requests.get(f"{store_url}/sitemap.xml", proxies=proxies, timeout=30).text
root = ET.fromstring(index)
for loc in root.findall(".//sm:loc", NS):
if "sitemap_products" in loc.text:
sub = requests.get(loc.text, proxies=proxies, timeout=30).text
subroot = ET.fromstring(sub)
for u in subroot.findall(".//sm:loc", NS):
urls.append(u.text)
return urlsFrom each product URL, parse the storefront HTML. Shopify usually embeds a full product object in a <script type="application/json"> tag with id="ProductJson-*" — same fields as /products.json.
4. What's Inside the JSON
Each product object in /products.json is rich. The fields you'll actually use:
| id, handle, title | Stable identifiers; handle is the URL slug |
| vendor, product_type | Brand and category tag |
| tags | Array of free-form tags — great for clustering |
| variants[] | SKU, price, compare_at_price, weight, barcode, available |
| options[] | Size/color/style definitions |
| images[] | Full-resolution Shopify CDN URLs |
| body_html | Rich-text product description |
| created_at, updated_at, published_at | ISO timestamps — use updated_at to detect inventory/price changes |
One field that's not exposed: exact inventory count. You get available: true/false per variant only.
5. Practical Use Cases
- →Competitor research. Snapshot a rival's catalog, monitor pricing and new SKU drops via
updated_at. - →Dropshipping product sourcing. Cross-reference popular Shopify products with supplier catalogs.
- →Market trend analysis. Track tag frequency across hundreds of stores to spot rising categories.
- →Brand monitoring. Detect unauthorized resellers by name-matching your vendor field on other stores.
- →Feed ingestion. Power product comparison sites without per-merchant API integrations.
Related Guides
Crawl Shopify Without Rate Limits
Rotate mobile IPs with one API call, never hit per-origin throttling. Test it for $5.