conscious-sapphire•7mo ago

Shared external queue between multiple crawlers

Hello folks! Is there any way i can force cheerio/playwright crawlers to stop using their own internal request queue and instead "enqueue links" to another queue service such as Redis? I would like to achieve this in order to be able to run multiple crawlers on a single website and i would need them to share the same queue so they won't use duplicate links. Thanks in advance!

4 Replies

MEE6•7mo ago

@mesca4046 just advanced to level 1! Thanks for your contributions! 🎉

Hall•7mo ago

Someone will reply to you shortly. In the meantime, this might help: -# This post was marked as solved by mesca4046. View answer.

Marco•7mo ago

Hello! The request queue is managed by Crawlee, and not by Cheerio or Playwright directly. What you could try to do, is creating a custom RequestQueue which inherits Crawlee's class: https://crawlee.dev/api/core/class/RequestQueue. Here is the source code: https://github.com/apify/crawlee/blob/master/packages/core/src/storages/request_queue.ts#L77. Then, you could pass the custom queue to the (Cheerio/Playwright) Crawler: https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler#requestQueue.

CheerioCrawler | API | Crawlee · Build reliable crawlers. Fast.

Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.

GitHub

crawlee/packages/core/src/storages/request_queue.ts at master · api...

Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, an...

patrikbraborec•2w ago

Hi @mesca4046 , as we talked on today's call, your problem is that you would ideally have multiple instances of the crawler running in parallel on the same queue, rather than just being able to increase the RAM of a single instance. You're looking for true parallelization where multiple crawlers can work together simultaneously to speed up the sitemap generation process, correct?

Shared external queue between multiple crawlers

Did you find this page helpful?