fair-rose•2y ago
Crawler becomes idle after some time (queue not empty)
Hi,
I'm struggling to understand why my crawler gets idle after some time. I have tried with multiple proxy providers (and mixed) so it doesn't seem to me "proxy throttling". The crawler becomes idle (ie not crawling nor scrapping) but the request queue (v1) isn't empty and the CPU is quite busy. When it starts running again it seems the CPU usage drops again. The crawler can be idle for even 5min or so, so it's quite a long time, until it resumes. Statistics and the AutoscalePool report this:
Maybe worth mentioning: my request queue (single one) is now with 2.7GB.
Any suggestion what might be happening? Is it some sort of "clean up" on the queue?
EDIT:
This really seems RequestQueue related. I reached a point the crawler won't crawl anymore and I get this:
Any suggestions? I guess removing already crawled urls from the queue would be a solution, not sure if a good one though.
Thank you!
4 Replies
ratty-blush•2y ago
Sounds like you may be in an infinite loop. Have any while loops in your code? Ha ve you stepped through it with a debugger?
flat-fuchsia•2y ago
Did you run this on Apify or locally? 👀
If you did, can you share a run link (in dms works too)
fair-roseOP•2y ago
@vladdy locally (or on my own server, if that matters). The issue really seems to be linked to the very big request queue it seems. It gets to a point that the code takes more than 300s to handle it and so it aborts. I'm now cleaning the the request queue periodically (actually deleting a bunch of files/urls that have been crawled) to make sure it doesn't get huge... not ideal though
Yeah, the Crawlee team needs to fix this