How to make crawlee try to refetch? | Apify & Crawlee | Page 1

finite jolt Nov 4, 2022, 12:45 PM

#

If the return value of the http api I crawl does not meet expectations, but http status is 200

How can I mark this request as a failure and let crawlee get it again with next proxy?

hybrid locust Nov 4, 2022, 7:44 PM

#

From what I understood, you want to make a request based on the data you receive from the initial request? If yes, then you can use the context object in the requestHandler to make a new request or enqueue a new request like this.

 import { HttpCrawler } from '@crawlee/http';

const crawler = new HttpCrawler({
    async requestHandler({ crawler, sendRequest, request }) {
        // Send request right away and get a response
        const { body } = await sendRequest({
            url: request.url
        })

        // RequestOptions with custom uniqueKey to prevent Crawlee from thinking its a duplicate request
        const newRequest: RequestOptions = {
            url: request.url,
            uniqueKey: Date.now().toString()
        } 
        
        // Enqueue request
        await crawler.addRequests([newRequest])
    },
});

await crawler.run([
    'http://www.example.com/page-1',
    'http://www.example.com/page-2',
]);

waxen bear Nov 4, 2022, 10:04 PM

#

wouldnt it make crawlee think its a duplicate?

hybrid locust Nov 5, 2022, 6:57 AM

#

Good point, from my understanding it shouldn't be a problem if you're using sendRequest for the new request, but if you're using crawler.addRequests you will have to manually generate a uniqueKey for each RequestOptions to prevent it from being marked as duplicate. I have updated my snippet to show how to do this.

waxen bear Nov 5, 2022, 8:29 AM

#

thanks

cursive thunder Nov 5, 2022, 7:21 PM

#

throw new Error("REASONOFRETRY")

fallen whale Nov 7, 2022, 5:25 PM

#

You can also do session.retire() before the throw to ensure it is discarded. Normally, it only increases error score for it

#How to make crawlee try to refetch?