#Scrape the subpages of a website: depth variable possible?

1 messages · Page 1 of 1 (latest)

river summit
#

Hi guys, I searched for that one but could not find a answer: I am building a web crawler which I have implemented at the moment using Puppeteer only. There I can use custom JS to control the depth of a query, basically how often the recursive function is called.

I tried the crawlee and it is so good! But unfortunately it collects way too many links, even some I don't need. Hence my question:

Is it possible to set the depth of the crawler? E.g: it should only scrape the links of the website and not explore the links further.

Thanks!!

sharp ravine
#

Yes, it is possible. Just enqueue the links with some label. Then create special handler function for that label, so you know here you do not want to enqueue anything more.

hidden mortar
#

Hi @river summit , you may also use transformRequestFunction to setup userdata, which are passed to the next level requests, this way you may set userdata.level= 0 when not set otherwise userdata.level= request.userdata.level + 1

and then you may setup condition for enqueueing:

if (!request.userData.level || request.userData.level < 5) {
  await enqueueLinks({...});
}

So it will not continue queueing new requests.

I am not sure about the version of SDK, that the scraper uses, but you may check https://docs.apify.com/sdk/js/docs/2.3/api/utils to at least get the idea

A namespace that contains various utilities.

river summit
#

Thank you both @sharp ravine @hidden mortar for the provided sources, I'll check it out. Big thx!

river summit
# hidden mortar Hi <@324520191088132107> , you may also use `transformRequestFunction` to setup ...

Something like this I worked out, but its not working for the depth. Still gives me way too many links. Any ideas? ``` import { PuppeteerCrawler } from "crawlee";

const crawler = new PuppeteerCrawler({
async requestHandler({ request, enqueueLinks, log }) {
log.info(request.url);

if (!request.userData.level || request.userData.level < 0) {
  await enqueueLinks({
    transformRequestFunction: (request) => {
      return {
        ...request,
        userData: {
          level: request.userData.level + 1,
        },
      };
    },
  });
}

},
});

await crawler.run([
{
url: "http://www.url.de",
userData: {
level: 0,
},
},
]);```

hidden mortar
#

@river summit This showld work, does it occurs even for other websites?