One or multiple instances of CheerioCrawler?

Question

Hi community! I'm new to Crawlee, and I'm building a script that scrapes a lot of specific, different domains. These domains each have a different number of pages to scrape; some have 2 to 3 thousand pages, while others might have just a few hundred (or even less).
The thing I have doubts about is: if I put all starting URLs in the same crawler instance, it might finish scraping a domain way before another one. I thought about separating domains, creating a crawler instance for each domain, just so that I can run each crawler separately and let them run their own course.
Is there any downside to this, e.g. will it need significantly more resources? Is there a better strategy?
TIA

Oleg V. · Answer

For your use case, creating a separate crawler instance for each domain could work, but it has potential downsides. Here's a breakdown to help you decide:

Downsides of Multiple Crawler Instances:

Increased Resource Usage: Each crawler instance runs its own event loop, maintains its own RequestQueue, and consumes memory. If you have many domains, this approach might significantly increase resource consumption.
Coordination Complexity: Managing multiple crawlers can become complicated, especially when you need to monitor or restart them individually.
Potential Limits on Concurrency: Depending on your system, running many instances in parallel might lead to bottlenecks (CPU, memory, network).

You can use one crawler instance with a shared RequestQueue and utilize domain-specific logic. Crawlee's flexibility makes this approach efficient:

some points:

Efficiency: A single instance uses resources more effectively.
Simpler Monitoring: You have only one crawler to monitor, restart, or debug.
Better Concurrency Management: Crawlee lets you adjust maxConcurrency and maxRequestsPerCrawl, so you can balance the load across domains.

Vice · Answer

How do you recommend handling domains with lots of pages? I wanna run the crawler every hour, but those domains take more than 2 hours sometimes to finish.

Apify Discord Mirror

One or multiple instances of CheerioCrawler?