#retryOnBlocked with HttpCrawler

1 messages · Page 1 of 1 (latest)

silent frigate
#

Hi, I'm using the HttpCrawler to scrape a static list of URLs. However, when I do get a 403 response as a result of CloudFlare challenge, the request is not retried with retryOnBlocked: true. However, if I remove retryOnBlocked, I see my errorHandler getting invoked and the request is retried. Do I understand retryOnBlocked wrong?

fossil zenith
#

Someone will reply to you shortly. In the meantime, we’ve found some posts that could help answer your question.

oblique owl
#

Hi @silent frigate can you provide us with minimal reproducable code?

iron sequoia
#

The error handler runs after every failed request, the failed request handler runs after max retries. Perhaps you might want to move some logic from one to the other?

silent frigate
#

@oblique owl Not sure if this reproduces it, in my case it lead to the described result:

const crawler = new HttpCrawler(
      {
        maxConcurrency: 2,
        maxRequestsPerMinute: 180,
        ...options,
        proxyConfiguration,
        useSessionPool: true,
        persistCookiesPerSession: true,
        retryOnBlocked: true,
        additionalMimeTypes: ["text/plain", "application/pdf"],
        async requestHandler({ pushData, request, response }) {
          await pushData({
            url: request.url,
            statusCode: response.statusCode,
          });
        },
        async failedRequestHandler({ pushData, request, response }) {
          log.error(`Request for URL "${request.url}" failed.`);
          await pushData({
            url: request.url,
            statusCode: response?.statusCode ?? 0,
          });
        },
        async errorHandler({ request }, { message }) {
          log.error(`Request failed with ${message}`);
          if (!request.noRetry) {
            const baseWaitTime = Math.pow(2, request.retryCount) * 1000;
            const jitter = baseWaitTime * (Math.random() - 0.5);
            const waitTime = baseWaitTime + jitter;
            await new Promise((resolve) => setTimeout(resolve, waitTime));
          }
        },
      },
      config,
    );
crawler.run(urls)

Nothing really special. I have two proxies in my configuration one in tier1 and the second in tier2.

oblique owl
#

Depends on the implementation of the website.

If you experience the captcha even in regular browser without proxy, than you cannot pass it just with HttpCrawler, you may need to you a browser-based solution like PuppeteerCrawler or PlaywrightCrawler.

If you don't experience the captcha in your browser - it could be about the quality of the Proxies that you set up - the website provides the captcha just to suspicious visitors (ip from proxy).