I have a scraper for Amazon pages. Everything was working fine for a month. I had lots of calls per actor run and all ultimately ended up being successful. An occasional 503 was returned but retries fixed the problem. As of last Friday I am getting a 503 for most calls. I was able to improve it a bit by:
- configuring proxies and retiring sessions
- residential proxies
- more headers, manually rotating user agent, etc.
- buying more IPs
Still only 70% of my calls are working. Before Friday everything was fine. Some requests also started to return CAPTCHA (haven't seen before). Once a CAPTCHA is returned, there is no way of recovering from that failure.
I use a CheerioCrawler with the following proxy/session initialisation:
const proxyConfiguration = await Actor.createProxyConfiguration({groups: ['RESIDENTIAL'], countryCode: "DE"});
useSessionPool: true,
sessionPoolOptions: {
sessionOptions: {
maxUsageCount: 5,
maxErrorScore: 1,
},
},
...
When a call fails (either 503 or I don't find specific information on my page), in the errorHandler I mark the session manually as bad
session.markBad();
In the log I see the session is re-generated and a retry happens, but subsequent calls fail.
My questions:
- is marking the session as bad make Apify use a new IP?
- is there a way to debug the IPs? I can get the proxy information but this is not equal to the actual IP of the request... If it is, it seems like the IPs are not rotated on failure since one failed request ultimately ends up failing on all (10) following retries and all urls are equal
- how can I improve my scraping in any other way? Right now it is actually unusable since 30% of my calls fail 😦
- where can I find a high-level architectural documentation on Apify? The majority of docs seem to be autogenerated and rarely describe the flow/architecture of Apify/Cheerio
I would be very thankful for any help here.
Cheers,
Chris