conscious-sapphire•7mo ago
Shared external queue between multiple crawlers
Hello folks!
Is there any way i can force cheerio/playwright crawlers to stop using their own internal request queue and instead "enqueue links" to another queue service such as Redis? I would like to achieve this in order to be able to run multiple crawlers on a single website and i would need them to share the same queue so they won't use duplicate links.
Thanks in advance!
4 Replies
@mesca4046 just advanced to level 1! Thanks for your contributions! 🎉
Someone will reply to you shortly. In the meantime, this might help:
-# This post was marked as solved by mesca4046. View answer.
Hello!
The request queue is managed by Crawlee, and not by Cheerio or Playwright directly. What you could try to do, is creating a custom
RequestQueue
which inherits Crawlee's class: https://crawlee.dev/api/core/class/RequestQueue. Here is the source code: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue.ts#L77.
Then, you could pass the custom queue to the (Cheerio/Playwright) Crawler: https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler#requestQueue.CheerioCrawler | API | Crawlee · Build reliable crawlers. Fast.
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
GitHub
crawlee/packages/core/src/storages/request_queue.ts at master · api...
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, an...
Hi @mesca4046 , as we talked on today's call, your problem is that you would ideally have multiple instances of the crawler running in parallel on the same queue, rather than just being able to increase the RAM of a single instance. You're looking for true parallelization where multiple crawlers can work together simultaneously to speed up the sitemap generation process, correct?