#Crawler becomes idle after some time (queue not empty)

1 messages · Page 1 of 1 (latest)

buoyant shard
#

Hi,

I'm struggling to understand why my crawler gets idle after some time. I have tried with multiple proxy providers (and mixed) so it doesn't seem to me "proxy throttling". The crawler becomes idle (ie not crawling nor scrapping) but the request queue (v1) isn't empty and the CPU is quite busy. When it starts running again it seems the CPU usage drops again. The crawler can be idle for even 5min or so, so it's quite a long time, until it resumes. Statistics and the AutoscalePool report this:

INFO  Statistics: null request statistics: {"requestAvgFailedDurationMillis":5624,"requestAvgFinishedDurationMillis":6949,"requestsFinishedPerMinute":90,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2293976046,"requestsTotal":330121,"crawlerRuntimeMillis":219208224,"retryHistogram":[283519,37607,7147,1483,296,69]}
2024-03-11 16:09:49.748 INFO  AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}

Maybe worth mentioning: my request queue (single one) is now with 2.7GB.

Any suggestion what might be happening? Is it some sort of "clean up" on the queue?

EDIT:
This really seems RequestQueue related. I reached a point the crawler won't crawl anymore and I get this:

WARN  RequestQueue: The request queue seems to be stuck for 300s, resetting internal state. {"inProgress":[]}

Any suggestions? I guess removing already crawled urls from the queue would be a solution, not sure if a good one though.

Thank you!

tribal grove
#

Sounds like you may be in an infinite loop. Have any while loops in your code? Ha ve you stepped through it with a debugger?

snow shale
#

Did you run this on Apify or locally? 👀

#

If you did, can you share a run link (in dms works too)

buoyant shard
#

@snow shale locally (or on my own server, if that matters). The issue really seems to be linked to the very big request queue it seems. It gets to a point that the code takes more than 300s to handle it and so it aborts. I'm now cleaning the the request queue periodically (actually deleting a bunch of files/urls that have been crawled) to make sure it doesn't get huge... not ideal though