Best practices for long living crawler & rabbitmq
Hi guys,
I’m here to ask best practices. I have coded a simple manager:
- starts crawlee instance and initializes queue object
- receives rabbitmq messages and pushes them to the queue
This is set up as keep alive crawler (it never quits, just awaits new messages). Thing is that it has some cache and some work around the queue and crawler to not actually let it close (for some reason it was just stopping and new messages were pushed tu queue but it wasn’t read by the crawler)
This made me wonder, maybe it should be built different?
Is there any resource that would help me learn about best practices in building such thing on crawlee? Docs lack long living crawler examples
I’ll add that my setup is using many different handlers for different sites - don’t know if it’s important for this question
5 Replies
@Cypher just advanced to level 1! Thanks for your contributions! 🎉
Bump
Hello! Thanks for your question.
Unfortunately, without the logs and the code, I won’t be able to help you properly.
Regarding keepAlive: true — yes, there is a way to prevent the crawler from stopping
Hi @Nazar Hrozia!
I can prepare and share a sample repository, but in 2 days - I’m away from a computer right now.
I was hoping there is some place where I can see how it ‘should’ be built
You can read these docs: https://crawlee.dev/js/api/basic-crawler/interface/BasicCrawlerOptions#keepAlive or https://crawlee.dev/python/api/class/_BasicCrawlerOptions#keep_alive, it should be enough to run your crawler in non-stop mode.
_BasicCrawlerOptions | API | Crawlee for Python · Fast, reliable P...
Crawlee helps you build and maintain your Python crawlers. It's open source and modern, with type hints for Python to help you catch bugs early.
BasicCrawlerOptions | API | Crawlee for JavaScript · Build reliabl...
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.