fair-rose
fair-rose•2y ago

Crawler becomes idle after some time (queue not empty)

Hi, I'm struggling to understand why my crawler gets idle after some time. I have tried with multiple proxy providers (and mixed) so it doesn't seem to me "proxy throttling". The crawler becomes idle (ie not crawling nor scrapping) but the request queue (v1) isn't empty and the CPU is quite busy. When it starts running again it seems the CPU usage drops again. The crawler can be idle for even 5min or so, so it's quite a long time, until it resumes. Statistics and the AutoscalePool report this:
INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":5624,"requestAvgFinishedDurationMillis":6949,"requestsFinishedPerMinute":90,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2293976046,"requestsTotal":330121,"crawlerRuntimeMillis":219208224,"retryHistogram":[283519,37607,7147,1483,296,69]}
2024-03-11 16:09:49.748 INFO AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
INFO Statistics: null request statistics: {"requestAvgFailedDurationMillis":5624,"requestAvgFinishedDurationMillis":6949,"requestsFinishedPerMinute":90,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2293976046,"requestsTotal":330121,"crawlerRuntimeMillis":219208224,"retryHistogram":[283519,37607,7147,1483,296,69]}
2024-03-11 16:09:49.748 INFO AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
Maybe worth mentioning: my request queue (single one) is now with 2.7GB. Any suggestion what might be happening? Is it some sort of "clean up" on the queue? EDIT: This really seems RequestQueue related. I reached a point the crawler won't crawl anymore and I get this:
WARN RequestQueue: The request queue seems to be stuck for 300s, resetting internal state. {"inProgress":[]}
WARN RequestQueue: The request queue seems to be stuck for 300s, resetting internal state. {"inProgress":[]}
Any suggestions? I guess removing already crawled urls from the queue would be a solution, not sure if a good one though. Thank you!
4 Replies
ratty-blush
ratty-blush•2y ago
Sounds like you may be in an infinite loop. Have any while loops in your code? Ha ve you stepped through it with a debugger?
flat-fuchsia
flat-fuchsia•2y ago
Did you run this on Apify or locally? 👀 If you did, can you share a run link (in dms works too)
fair-rose
fair-roseOP•2y ago
@vladdy locally (or on my own server, if that matters). The issue really seems to be linked to the very big request queue it seems. It gets to a point that the code takes more than 300s to handle it and so it aborts. I'm now cleaning the the request queue periodically (actually deleting a bunch of files/urls that have been crawled) to make sure it doesn't get huge... not ideal though
Lukas Krivka
Lukas Krivka•2y ago
Yeah, the Crawlee team needs to fix this

Did you find this page helpful?