ordinary-sapphireO

RequestQueue limitations or how to run big crawls

Using crawlee for recursive website crawling and have been facing a lot of problems with RequestQueue

Firstly when I started using it, I was just adding all the urls I find on the page to queue, and when queue reached like 5M requests it just said 'reached open file limit' and crushed.

Then, I've added the RabbitMQ queue, so instead of putting all the data directly to the RequestQueue, I refilled it once it started runing out of requests. And got the second issue, once I've crawled like 500k urls it begun performing poorly. It took 100% of my cpu and SSD while enqueuing.

Maybe I'm doing something extremely wrong, but see no other ways to do it scince documentation doesn't really explains much. As I understand, there are only two types of queues: simple array (but you cannot add requests dynamically) and RequestProvider. And only 2 providers: RequestQueue and RequestQueueV2 (what is the difference I have no idea, no information in documentation).

Are there any compatible alternatives to RequestQueue, so it will use RabbitMQ or redis or simple database, but not files written to hard drive?
Was this page helpful?