genetic-orange•2y ago
RequestQueue limitations or how to run big crawls
Using crawlee for recursive website crawling and have been facing a lot of problems with RequestQueue
Firstly when I started using it, I was just adding all the urls I find on the page to queue, and when queue reached like 5M requests it just said 'reached open file limit' and crushed.
Then, I've added the RabbitMQ queue, so instead of putting all the data directly to the RequestQueue, I refilled it once it started runing out of requests. And got the second issue, once I've crawled like 500k urls it begun performing poorly. It took 100% of my cpu and SSD while enqueuing.
Maybe I'm doing something extremely wrong, but see no other ways to do it scince documentation doesn't really explains much. As I understand, there are only two types of queues: simple array (but you cannot add requests dynamically) and RequestProvider. And only 2 providers: RequestQueue and RequestQueueV2 (what is the difference I have no idea, no information in documentation).
Are there any compatible alternatives to RequestQueue, so it will use RabbitMQ or redis or simple database, but not files written to hard drive?
6 Replies
No there aren’t . We can contribute to the framework with one if we need.
harsh-harlequin•2y ago
We have the same problem😫
You can try the request list
I remembered it was more efficient for me in a case with millions of URLs
Hi 3678 @charliechen , @NeoNomade ,
This limitation is there for reasons, if you want to overcome it I believe you should be able to override the RequestQueue class and implement using different type for Storage which is no that limited, or you can try-catch this specific exception and and open new named RequestQueue under the hood and basicaly managing more RequestQueues using the same interface.
Maybe you should bealso able to override the method for marking requests to handled to simply delete them when handled properly - depends on how you use the RequestQueue.
I haven't tested it and I can't provide you support for this, but these would be my first steps on how to deal with it 🙂
Hi @Pepa J .
I’m not working right now on this issue, but I’ll come back with updates .
Now I’m working on some spiders that receive json configs and run based on that. (What selectors to use for what actions, provided in json )
Its really better to reconsider approach or logic, because SDK itself limited by 9.5Mb data size so with millions of requests and probably anything as output crawler will generate data too big to be managed by SDK.