Apify & CrawleeA&CApify & Crawlee
Powered by
sad-indigoS
Apify & Crawlee•2y ago•
5 replies
sad-indigo

Crawler becomes idle after some time (queue not empty)

Hi,

I'm struggling to understand why my crawler gets idle after some time. I have tried with multiple proxy providers (and mixed) so it doesn't seem to me "proxy throttling". The crawler becomes idle (ie not crawling nor scrapping) but the request queue (v1) isn't empty and the CPU is quite busy. When it starts running again it seems the CPU usage drops again. The crawler can be idle for even 5min or so, so it's quite a long time, until it resumes. Statistics and the AutoscalePool report this:

INFO  Statistics: null request statistics: {"requestAvgFailedDurationMillis":5624,"requestAvgFinishedDurationMillis":6949,"requestsFinishedPerMinute":90,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2293976046,"requestsTotal":330121,"crawlerRuntimeMillis":219208224,"retryHistogram":[283519,37607,7147,1483,296,69]}
2024-03-11 16:09:49.748 INFO  AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
INFO  Statistics: null request statistics: {"requestAvgFailedDurationMillis":5624,"requestAvgFinishedDurationMillis":6949,"requestsFinishedPerMinute":90,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2293976046,"requestsTotal":330121,"crawlerRuntimeMillis":219208224,"retryHistogram":[283519,37607,7147,1483,296,69]}
2024-03-11 16:09:49.748 INFO  AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":20,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}

Maybe worth mentioning: my request queue (single one) is now with 2.7GB.

Any suggestion what might be happening? Is it some sort of "clean up" on the queue?

EDIT:
This really seems RequestQueue related. I reached a point the crawler won't crawl anymore and I get this:
WARN  RequestQueue: The request queue seems to be stuck for 300s, resetting internal state. {"inProgress":[]}
WARN  RequestQueue: The request queue seems to be stuck for 300s, resetting internal state. {"inProgress":[]}


Any suggestions? I guess removing already crawled urls from the queue would be a solution, not sure if a good one though.

Thank you!
Apify & Crawlee banner
Apify & CrawleeJoin
This is the official developer community of Apify and Crawlee.
13,739Members
Resources
Recent Announcements

Similar Threads

Was this page helpful?
Recent Announcements
ellativity

**The Apify $1M Challenge is over!** For everyone who joined yesterday’s Award Ceremony livestream for the Apify $1M Challenge, thank you for your enthusiastic drumrolls in the chat and positive vibes. We were really feeling the excitement and celebratory mood! If you missed the stream or just want to rewatch the key moments again, here’s the replay link https://www.youtube.com/watch?v=eEDV-5X43Gg (ngl, the replay is not the same without your live chat) And, if you didn’t check the email that should have landed in your inboxes, we’d love to hear about your experience of participating in the Apify $1M Challenge. **<a:alerthulk:1468892073917939713> Win one of five $100 Visa gift cards by completing the end-of-challenge survey here: https://apify.typeform.com/to/mjoMaZqD** Thank you again to everyone who participated in any capacity. The past 3 months have been a wild ride and we feel so grateful to have been on this adventure with y’all. We mean every word when we say how much you’ve impressed us. Thank you all from the bottom of our hearts. <a:keanuthanks:1430839059655426068> Saurav and Ella xoxo PS - if you just want to jump to the spoilers, a full list of winners is available at https://apify.com/challenge 🏆

ellativity · 4d ago

ellativity

**You are invited** ... to celebrate all the achievements of the Apify $1M Challenge with us on Wednesday, February 4 at **8 AM PT / 11 AM ET / 4 PM GMT / 5 PM CET / 9:30 PM IST / 12 AM +1d CST** We will be announcing winners of the Grand Prizes, as well as regional winners and much more, with especially good news for all participating developers. 🏆 We look forward to sharing with you all! 🎉 More info here: https://luma.com/6c1493t0

ellativity · 2w ago

ellativity

Hi @everyone 👋 I'm hanging out in https://discord.com/channels/801163717915574323/1430491198145167371 for the next 45 min, if you want to discuss the end of the challenge or anything else.

ellativity · 2w ago

Similar Threads

Invalidate request queue after some time
MrSquaareMMrSquaare / crawlee-js
8mo ago
PlaywrightCrawler hangs up after some time
MatzeMMatze / crawlee-js
4y ago
crawler process not exiting after teardown is called.
verbal-limeVverbal-lime / crawlee-js
3y ago
how to clear request queue without stoping crawler
ill-bronzeIill-bronze / crawlee-js
4y ago