Website Content Crawler Actor - Get access to failed urls | Apify & Crawlee | Page 1

snow ore Feb 28, 2025, 7:16 AM

#

Hi,
I am using the actor Website Content Crawler (apify/website-content-crawler) to scrape a few thousand urls. These are a predecided list of a urls, so the depth is set to 0. I saw a few of these fail. Is there any way to get access to these failed urls from either the apify site UI or the integration? The Dataset under Storage only gives the successful urls.

mental jacinth Feb 28, 2025, 7:16 AM

#

Someone will reply to you shortly. In the meantime, this might help:

normal willow Feb 28, 2025, 12:43 PM

#

Hello! Unfortunately, I can't find any option for that. I see that an issue has already being opened for the actor, I was going to suggest that. In the meantime, if you have a fixed list of URLs, you could compare it to the Actor's output, but whether it would be acceptable depends on your use-case.

snow ore Mar 1, 2025, 8:33 AM

#

Hey @normal willow , yeah we've written a local script to do it the comparison and extract. I was hoping if that data was release in the Dataset, we could integrate it easily with Google Sheet, so we could just copy those and run a rebound for the failed ones. This would allow us to run multipe instances of the actor without having to track and run the failed script for each set with the respective input url list.
If this was being done for a crawler based scraping instead of fixed list, it would be a much bigger challenge.
I got an email suggesting another actor retry failed urls but that's not the only way we intend to use it. It would be minor but impactful change to have it as part of the actor itself.

#

There's also another issue of the actor starting off with less RAM usage but maxing out (16 GB) after a few hours of runtime running up our bills.

#Website Content Crawler Actor - Get access to failed urls