Apify Discord Mirror

Updated last week

One or multiple instances of CheerioCrawler?

Hi community! I'm new to Crawlee, and I'm building a script that scrapes a lot of specific, different domains. These domains each have a different number of pages to scrape; some have 2 to 3 thousand pages, while others might have just a few hundred (or even less).
The thing I have doubts about is: if I put all starting URLs in the same crawler instance, it might finish scraping a domain way before another one. I thought about separating domains, creating a crawler instance for each domain, just so that I can run each crawler separately and let them run their own course.
Is there any downside to this, e.g. will it need significantly more resources? Is there a better strategy?
TIA
O
V
2 comments
For your use case, creating a separate crawler instance for each domain could work, but it has potential downsides. Here's a breakdown to help you decide:

Downsides of Multiple Crawler Instances:
  1. Increased Resource Usage: Each crawler instance runs its own event loop, maintains its own RequestQueue, and consumes memory. If you have many domains, this approach might significantly increase resource consumption.
  2. Coordination Complexity: Managing multiple crawlers can become complicated, especially when you need to monitor or restart them individually.
  3. Potential Limits on Concurrency: Depending on your system, running many instances in parallel might lead to bottlenecks (CPU, memory, network).
You can use one crawler instance with a shared RequestQueue and utilize domain-specific logic. Crawlee's flexibility makes this approach efficient:

some points:
  1. Efficiency: A single instance uses resources more effectively.
  2. Simpler Monitoring: You have only one crawler to monitor, restart, or debug.
  3. Better Concurrency Management: Crawlee lets you adjust maxConcurrency and maxRequestsPerCrawl, so you can balance the load across domains.
How do you recommend handling domains with lots of pages? I wanna run the crawler every hour, but those domains take more than 2 hours sometimes to finish.
Add a reply
Sign up and join the conversation on Discord