Amazon scraping stopped working suddenly last Friday | Apify & Crawlee | Page 1

jade wigeon May 16, 2023, 11:00 PM

#

I have a scraper for Amazon pages. Everything was working fine for a month. I had lots of calls per actor run and all ultimately ended up being successful. An occasional 503 was returned but retries fixed the problem. As of last Friday I am getting a 503 for most calls. I was able to improve it a bit by:

configuring proxies and retiring sessions
residential proxies
more headers, manually rotating user agent, etc.
buying more IPs

Still only 70% of my calls are working. Before Friday everything was fine. Some requests also started to return CAPTCHA (haven't seen before). Once a CAPTCHA is returned, there is no way of recovering from that failure.
I use a CheerioCrawler with the following proxy/session initialisation:

const proxyConfiguration = await Actor.createProxyConfiguration({groups: ['RESIDENTIAL'], countryCode: "DE"});

useSessionPool: true,
  sessionPoolOptions: {            
    sessionOptions: {               
      maxUsageCount: 5,
      maxErrorScore: 1,
    },        
  },
...

When a call fails (either 503 or I don't find specific information on my page), in the errorHandler I mark the session manually as bad

session.markBad();

In the log I see the session is re-generated and a retry happens, but subsequent calls fail.

My questions:

is marking the session as bad make Apify use a new IP?
is there a way to debug the IPs? I can get the proxy information but this is not equal to the actual IP of the request... If it is, it seems like the IPs are not rotated on failure since one failed request ultimately ends up failing on all (10) following retries and all urls are equal
how can I improve my scraping in any other way? Right now it is actually unusable since 30% of my calls fail 😦
where can I find a high-level architectural documentation on Apify? The majority of docs seem to be autogenerated and rarely describe the flow/architecture of Apify/Cheerio

I would be very thankful for any help here.

Cheers,

Chris

wanton lion May 17, 2023, 2:24 PM

#

Hey there! So no answer your questions:

marking session as bad increases errorScore by 1. So if you have it set as max error score - it means the session is retired.
you could make additional request e.g. to https://api.apify.com/v2/browser-info?skipHeaders=1 to get the IP address.
this is hard to say, it's all about trying this and that. You could try using the headers provided by fingerprint suite (enabled by default in got-scraping), you could try residential proxy, browser, retire the session e.g. after 3 requests etc. There's really no universal answer for this, but amazon is know for rather hard blocking. You could check more here: https://docs.apify.com/academy/anti-scraping/techniques
we hardly have any auto-generated docs. Not sure if you mean something like this https://docs.apify.com/academy/apify-scrapers/cheerio-scraper ? Or you could generally check the following part of the docs: https://docs.apify.com/academy

Anti-scraping techniques | Apify Documentation

Understand the various common (and obscure) anti-scraping techniques used by websites to prevent bots from accessing their content.

Scraping with Cheerio Scraper | Apify Documentation

Learn how to scrape a website using Apify's Cheerio Scraper. Build an actor's page function, extract information from a web page and download your data.

Web Scraping Academy | Apify Documentation

Learn everything about web scraping and automation with our free courses that will turn you into an expert scraper developer.

jade wigeon May 17, 2023, 5:34 PM

#

Hey Andrey, thanks a lot for the answer. Regarding the auto-generated docs: I was rather referring to the API docs. Where I would actually expect some more than a pure description of the parameters but also an example here and there (for example the tip you gave me how to inspect the IP, etc.) But thanks for pointing me to the docs, I will check them out.

I already tried using residential proxies but it actually didn't improve much....

Since I am scraping using Cheerio, I have my pregenerated URLs which I just add to the queue/list. Now, I could obviously try using puppeteer/playwright and actually invoke the main page and have the browser navigate to a specific url and pretend to be a real browser, but that sounds like a very slow approach. Also, I could potentially fake it using cheerio too, having each product request visiting the root page first and using the response headers/cookies to visit the product page.

One last question: do you have an idea why it all seems to have stopped working last Friday? Up until then everything worked like a charm and only a few requests failed. Or is it coincidence that just more IPs have been burnt by someone firing too many requests?

Thanks,

Chris

wanton lion May 17, 2023, 11:02 PM

#

About the docs - we know that they aren't perfect, and we are constantly working on improving them. I's day we have examples section, but you are right - we definitely should add more smaller examples. Thanks for the feedback, we definitely will continue working on it 👍

About Friday - to be honest - can't really come up with some explanation. Could be a coincidence, could be some change from Amazon side, or maybe you could let it cool off for a few days, and if it will start working after 2-3 days again - maybe they just detected the addresses and temporarily blocked them...

#Amazon scraping stopped working suddenly last Friday