❓ Help Needed: Downloading Linked PDF Files with Crawlee 🕸📥 | Apify & Crawlee | Page 1

quick oyster Jul 28, 2023, 6:49 PM

#

Hello everyone,

I need some help with Crawlee. I've been using CheerioCrawler to scrape pages and I've managed to extract links and store page titles and URLs into a dataset. Now I want to add functionality to download linked files, like PDFs, from the scraped pages. However, I'm unsure how to do this natively with Crawlee.

Here's my current code:


// CheerioCrawler crawls the web using HTTP requests
// and parses HTML using the Cheerio library.
const crawler = new CheerioCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, $, enqueueLinks, log }) {
        const title = $('title').text();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://cloudflare.net/events-and-presentations']);```

Could anyone guide me on how to modify this code to download linked files, specifically PDFs, from the scraped pages? Any help would be appreciated, thank you!

quick oyster Aug 2, 2023, 7:41 AM

#

can anyone help?

scenic hound Aug 2, 2023, 9:54 AM

#

Hello @quick oyster
There is a code I didn't tested, but you may get the idea out of it:

async requestHandler({ $, sendRequest }) {
    const urls = $('a[href]')
        .filter((_, el) => /\.pdf$/.test($(el).attr('href')!)) // filter links ending with .pdf
        .map((_, el) => $(el).attr('href'))
        .toArray();

    for (const url in urls) {
        // Do a request for the PDF file
        const pdfFileResponse = await sendRequest({
            url
        });

        const fileName = url.split('/').reverse()[0];

        await Actor.setValue(fileName, pdfFileResponse.rawBody, { contentType: 'application/pdf' });
    }
};

Basically it will store all the PDF to the storages/key-value-store/default when running locally

quick oyster Aug 2, 2023, 10:05 AM

#

Hi Pepa, thx, very helpful! Do you have any hint on how use Firebase Storage instead of the local key value store? My goal is to analyze the PDFs with an LLM and store the results in a vector databse.

scenic hound Aug 2, 2023, 10:09 AM

#

I believe there would be a npm package for firebase with proper documentation, I have no personal experience with it.

#❓ Help Needed: Downloading Linked PDF Files with Crawlee 🕸📥