Why Our Amazon Scrapers Broke Every 14 Days — And Why We Stopped Fixing Them
quality 6/10 · good
0 net
AI Summary
A detailed retrospective on why Mercado Libre's in-house Amazon scraper required constant maintenance (every 14 days) due to DOM selector instability, layout variants, anti-bot escalation, and JS-rendering dependencies—leading them to adopt a managed API solution with auto-healing extraction logic instead of maintaining 23 custom spiders.
Tags
Entities
Mercado Libre
Amazon
Scrapy
Selenium
Chromium
ScraperAPI
Oxylabs
Bright Data
AWS Data Exchange
Isabela Rodriguez
Why Our Amazon Scrapers Broke Every 14 Days — And Why We Stopped Fixing Them | by Isabela Rodriguez | in InfoSec Write-ups - Freedium
Milestone: 20GB Reached
We’ve reached 20GB of stored data — thank you for helping us grow!
Patreon
Ko-fi
Liberapay
Close
< Go to the original
Why Our Amazon Scrapers Broke Every 14 Days — And Why We Stopped Fixing Them
Context: Why a LATAM E-commerce Company Cares About Amazon Data
Isabela Rodriguez
Follow
InfoSec Write-ups
·
~6 min read
·
March 14, 2026 (Updated: March 14, 2026)
·
Free: Yes
Context: Why a LATAM E-commerce Company Cares About Amazon Data
At Mercado Libre , we operate the largest e-commerce ecosystem in Latin America. Pricing intelligence, catalog enrichment, assortment benchmarking, and competitive monitoring are part of our daily modeling workflows.
Even though Amazon does not operate uniformly across LATAM markets, it remains the global reference point for:
Cross-border pricing benchmarks
International brand positioning
Category trend detection
Buy Box and fulfillment logic modeling
Product taxonomy evolution
For several of our internal analytics pipelines, we needed high-frequency, structured Amazon product data across multiple marketplaces (US, UK, DE, JP).
So we built our own scraper fleet.
It worked.
Until it didn't.
The 14-Day Breakage Cycle
Over two years, our team maintained:
23 custom spiders
~1.4M product pages per week
6 Amazon marketplaces
4 engineers rotating on maintenance
The architecture:
Scrapy clusters
Rotating residential + datacenter proxies
Headless Chromium for JS-rendered elements
A versioned selector registry
We logged:
Selector failures
IP blocks
Layout variants
CAPTCHA frequency
Response anomalies
After 14 months of telemetry, one pattern stood out:
Selectors broke every 11–16 days (median: ~14 days).
This is not an official Amazon statistic. It reflects our monitored pages. But the pattern was consistent.
Why?
Because Amazon's product pages are not static. They are continuously reshaped by:
A/B experiments
Incremental UI rollouts
Component restructuring
Anti-bot countermeasures
A Real Example of Selector Fragility
This worked for two weeks: price = soup.select_one('.a-price-whole')
Then overnight: None
The DOM changed.
.a-offscreen moved under a new wrapper container.
Flat selectors stopped matching.
Selenium logs at 2AM: NoSuchElementException: Unable to locate element: .a-price-whole
The fix? 30 minutes.
The real cost?
4–6 hours:
Detect anomaly
Diagnose variant
Test across marketplaces
Deploy safely
Multiply by product, search, category, reviews, seller pages.
We were burning 80+ engineering hours per month just keeping extraction alive.
The Four Failure Modes That Stack Against You
Most scraping tutorials teach you how to extract data.
Few explain why that extraction will fail.
Here's what we learned.
1. Selector Instability
Amazon mixes:
Stable semantic classes ( a-price-whole )
Hashed rotating classes ( _cDEzb_p13n-sc-css-line-clamp-1_1Fn1y )
From our logs:
Hashed classes median lifespan: 9 days
"Stable" classes broke when parent structure changed
There is no such thing as a 30-day-stable selector.
2. Layout Variants
A single product page can render in 7+ structural formats.
Price might appear inside:
#corePrice_feature_div
#apex_offerDisplay_desktop
Nested accordion containers
Mobile vs desktop DOMs
We maintained 4–6 fallback paths per data field.
New variants appeared monthly.
3. Anti-Bot Escalation
Amazon deploys layered defenses:
Edge-level filtering
Request fingerprinting
TLS handshake analysis
Behavioral tracking
Silent 200 responses with degraded pages
Our block rate climbed:
8% → 27% in six months
Same proxy pool.
The real danger wasn't hard blocks.
It was silent garbage ingestion.
You must validate every response against an expected schema.
4. Rendering Dependencies
~40% of critical signals were JS-injected:
Buy Box
Stock status
Delivery estimate
Scrapy alone missed half the data.
Headless browsers increased compute costs 5–8x.
And introduced:
Memory leaks
Zombie processes
Container instability
The True Cost of In-House Amazon Scraping
Annualized:
Image Created with OpenAI
This excludes opportunity cost.
During 14 months of maintenance:
3 revenue-generating analytics features stayed in backlog
2.5 FTEs were debugging DOM changes instead of building models
For larger fleets (50K+ ASINs), peer teams report $150K–$300K annually .
Evaluating Alternatives
We tested three approaches:
1. General-Purpose Scraping APIs (ScraperAPI, Oxylabs)
Solved proxy rotation.
Did not solve selector maintenance.
Still paged when DOM changed.
2. Bulk Data Marketplaces (AWS Data Exchange)
Large weekly dumps.
No real-time filtering.
Too coarse for ASIN-level modeling.
3. Amazon-Specific Managed Scrapers
We needed:
Real-time ASIN-level pulls
Structured JSON
DOM-change resilience
Zero maintenance overhead
We decide to evaluate Bright Data's Amazon Scraper API based on some of our worldwide partners' recommendation.
What the Managed API Looks Like
import requests
import time
API_KEY = "YOUR_API_KEY"
DATASET_ID = "gd_l7q7dkf244hwjntr0" # your scraper's dataset_id
ASIN = "B0CX23V2ZK"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json",
}
# 1) Trigger the scraper
trigger_url = f"https://api.brightdata.com/datasets/v3/trigger?dataset_id={DATASET_ID}"
payload = [
{
"url": f"https://www.amazon.com/dp/{ASIN}"
# or "asin": ASIN # depends on the scraper's input schema
}
]
trigger_resp = requests.post(trigger_url, headers=headers, json=payload)
trigger_resp.raise_for_status()
snapshot_id = trigger_resp.json()["snapshot_id"]
print("Snapshot:", snapshot_id)
# 2) Poll snapshot status
snapshot_url = f"https://api.brightdata.com/datasets/v3/snapshot/{snapshot_id}"
while True:
status_resp = requests.get(snapshot_url, headers=headers)
status_resp.raise_for_status()
data = status_resp.json()
status = data["status"]
print("Status:", status)
if status == "ready":
break
if status == "failed":
raise RuntimeError("Collection failed")
time.sleep(5)
# 3) Download results as JSON
download_resp = requests.get(f"{snapshot_url}?format=json", headers=headers)
download_resp.raise_for_status()
products = download_resp.json()
print("Got", len(products), "records")
print(products[0]) # should include title, price, rating, reviews, etc.
We send an ASIN. We receive structured JSON.
No selector maintenance.
No browser cluster.
No proxy management.
Before vs After Using Amazon Scraper API
Table created using OpenAI
Why Auto-Healing Was the Real Differentiator
The key differentiator wasn't just proxy management or higher success rates.
It was auto-healing extraction logic .
In a traditional scraper architecture, the extraction layer is tightly coupled to the DOM. When Amazon:
Renames a class
Wraps an element in a new container
Moves a price block under a different component
Introduces a new layout variant
Injects content via a modified JavaScript sequence
…your selectors fail.
Even if the data is still present on the page, your parser no longer knows where to look.
In a DIY system, that triggers a predictable cycle:
Monitoring detects an anomaly (drop in field population or schema mismatch)
Engineers inspect raw HTML
New selectors are written and tested across marketplaces
Edge cases are validated
The parser is redeployed
Backfilled data is re-collected
This is the 4–6 hour "fix" that repeats every couple of weeks.
With an auto-healing managed scraper, that maintenance loop moves upstream.
When the DOM changes:
The extraction logic is updated server-side
Layout variants are detected automatically
CAPTCHA gates and bot defenses are handled at the infrastructure layer
The output schema remains stable
From our pipeline's perspective, nothing changes.
The API contract stays identical:
Same fields
Same structure
Same JSON schema
No code changes required
No redeployment required
The important nuance here is not just that selectors are updated — it's that the abstraction layer is preserved .
Our downstream systems (pricing models, feature engineering pipelines, dashboards) depend on schema stability. If a field disappears or shifts type, it can cascade into:
Model input failures
Dashboard errors
Mispriced SKUs
Alert storms
Auto-healing essentially decouples our analytics layer from Amazon's front-end volatility.
Instead of reacting to DOM changes, we stopped caring about them.
And for a data science team whose goal is to build models — not maintain parsers — that architectural separation made all the difference.
What Changed for Our Data Science Team
After migration:
Repricing engine shipped
Trend dashboard shipped
Forecasting model shipped
All three were blocked for over a year.
The migration took less than a day.
We replaced:
23 spiders
Proxy orchestration
Browser clusters
With a single endpoint.
Decision Framework
Not every team should switch to managed scraping. Managed scraping APIs makes sense when:
You scrape >10K pages/day
You monitor multiple marketplaces
Your data feeds revenue systems
You lack spare engineering bandwidth
Selector breakage impacts downstream ML models
DIY in-house scraping still works if:
Volume <1K pages/day
Single marketplace
95% success rate
Tolerable downtime
The Real Insight
In the three months after we migrated to Bright Data's Amazon Scraper API , our team shipped all three of those backlogged features: the repricing engine, the trend dashboard, and the prediction model. All three had been blocked for over a year because scraper maintenance consumed the engineering bandwidth required to build them.
Amazon scraping does not fail because engineers are careless.
It fails because Amazon's front-end evolves continuously.
When your analytics pipeline depends on unstable DOM structures, maintenance becomes a permanent tax on innovation.
For us, the tipping point wasn't cost.
It was opportunity cost.
And once we measured it, the decision became straightforward.
#amazon-scraping-api #data-science #ecommerce-software #amazon-web-services #scrapy
Reporting a Problem
Sometimes we have problems displaying some Medium posts.
If you have a problem that some images aren't loading - try using VPN. Probably you have problem with
access to Medium CDN (or fucking Cloudflare's bot detection algorithms are blocking you).