#Scraping single page with load more button

1 messages · Page 1 of 1 (latest)

knotty briar Sep 26, 2022, 6:39 AM

Hi, I just discovered Crawlee and seems a very great project.

I'm scraping a single url (https://jobs.workable.com/search) that contains a list of items with a load more button. Each time an item is clicked a floating modal show the item information.

In this scenario all the power of crawlee to remember visited urls, retries, etc is not a help.

My idea is:

From the start page, click on each of the initial items and scrape its content
Click on the load more button and repeat the process.

The help I'm requesting is in how to apply best practices for:

how to "remember/store" the last scrapped item index/id
how to handle with errors

Thanks in advance

Jobs

Search jobs using the new Job Finder from Workable. Explore thousands of open job listings hosted by Workable‘s all-in-one recruitment software, trusted by companies worldwide.

astral holly Sep 26, 2022, 9:15 AM

I'd recommend checking out the infiniteScroll function in Crawlee:
https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#infiniteScroll

Your use case with a Load more button can be solved by using the buttonSelector option which checks and clicks a button if it appears while scrolling.

See more in the docs: https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#buttonSelector

puppeteerUtils | API | Crawlee

A namespace that contains various utilities for
Puppeteer - the headless Chrome Node API.

Example usage:

import { launchPuppeteer, puppeteerUtils } from 'crawlee';

// Open https://www.example.com in Puppeteer
const browser = await launchPuppeteer();
const page = await browser.newPage...

And your clicking of each item on the page can be done in the stopScrollCallback:

https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#buttonSelector

puppeteerUtils | API | Crawlee

A namespace that contains various utilities for
Puppeteer - the headless Chrome Node API.

Example usage:

import { launchPuppeteer, puppeteerUtils } from 'crawlee';

// Open https://www.example.com in Puppeteer
const browser = await launchPuppeteer();
const page = await browser.newPage...