#which browser is the best to crawl

1 messages · Page 1 of 1 (latest)

torpid grail
#

As title said

I’m using chromium currently but it is cpu heavy in usage

Killing browser do not kill the process and because of that it’s easy to get 100% cpu usage pretty quickly

(I’m crawling thousands of websites where on each I’m looking for different data) I already try to load pure html without css, images and other assets, that helped a lot but issue is still there

rare vault
#

This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.

true yacht
#

Hi @torpid grail
I recommend also blocking unnecessary network requests. with the blockRequests
Make sure that are running it in headless mode.
Also you could try using cheerio if the use-case allows it.

Regarding your question about the browser:
Firefox tends to be lighter on CPU usage.

Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.

Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.

torpid grail
#

yes I already do that

   const launchContext: PlaywrightLaunchContext = {
      launcher: firefox,
      launchOptions: {
        headless: false,
        args: [
          '--no-sandbox',
          '--disable-setuid-sandbox',
          '--disable-dev-shm-usage',
        ],
      },
      useChrome: false, // Use Chromium instead of Chrome for better performance
      userAgent: userAgents[Math.floor(Math.random() * userAgents.length)],
    }
...
     launchContext,
      preNavigationHooks: [
        async ({ page }) => {
          await playwrightUtils.blockRequests(page, {
            urlPatterns: [
              '.png',
              '.jpg',
              '.jpeg',
              '.gif',
              '.svg',
              '.ico',
              '.woff',
              '.woff2',
              'adsbygoogle.js',
            ],
            extraUrlPatterns: ['adsbygoogle.js'],
          })

          await playwrightUtils.closeCookieModals(page)
        },
      ],

unfortunetly I recive: WARN Playwright Utils: blockRequests() helper is incompatible with non-Chromium browsers.

I didn't know that 😄