Hi,
I'm struggling to understand why my crawler gets idle after some time. I have tried with multiple proxy providers (and mixed) so it doesn't seem to me "proxy throttling". The crawler becomes idle (ie not crawling nor scrapping) but the request queue (v1) isn't empty and the CPU is quite busy. When it starts running again it seems the CPU usage drops again. The crawler can be idle for even 5min or so, so it's quite a long time, until it resumes. Statistics and the AutoscalePool report this:
INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":5624,"requestAvgFinishedDurationMillis":6949,"requestsFinishedPerMinute":90,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2293976046,"requestsTotal":330121,"crawlerRuntimeMillis":219208224,"retryHistogram":[283519,37607,7147,1483,296,69]}
2024-03-11 16:09:49.748 INFO AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
Maybe worth mentioning: my request queue (single one) is now with 2.7GB.
Any suggestion what might be happening? Is it some sort of "clean up" on the queue?
EDIT:
This really seems RequestQueue related. I reached a point the crawler won't crawl anymore and I get this:
WARN RequestQueue: The request queue seems to be stuck for 300s, resetting internal state. {"inProgress":[]}
Any suggestions? I guess removing already crawled urls from the queue would be a solution, not sure if a good one though.
Thank you!