exotic-emerald•3y ago
running multiple crawler instances at once
im scraping entire sites and running multiple crawlers at once for each site - looking to scrape 50+ sites, and im running multiple site scrapes at once from a start file emitting an event emitter to run each site, specifically the line.
Should i run them all at once in one terminal or run each one in separate terminals with different scripts to run each scraper
Also is this a maintanable approach to run multiple crawler instances at once
One final problem I am running into is that I'll run the start file but I get this request queue error when I run multiple crawlers when i run it again, it sometimes works, but it's inconsistent in how this error pops up request queue error:
One final problem I am running into is that I'll run the start file but I get this request queue error when I run multiple crawlers when i run it again, it sometimes works, but it's inconsistent in how this error pops up request queue error:
9 Replies
default requestQueue shared by crawlee instances, so to separate queue per crawler you need to name it
but recommended approach to not mix cralwers without reason, i.e. consider to crawl all sites by single crawler or (better) run actor per single site with single crawler.
exotic-emeraldOP•3y ago
how do you separate requestQueues for each instance
also is there a limit to how many instances can be run because when i run more than 5-6 crawlers, i get this error:
``
Error: This crawler instance is already running, you can add more requests to it via
crawler.addRequests()`.
at CheerioCrawler.runrival-black•3y ago
@harish Use a named requestQueue.
correct-apricot•3y ago
@harish like this:
vicious-gold•2y ago
Is there a reason to this? Not mixing of crawlers
like-gold•2y ago
hey is there any way to run multiple crawlers without Actors/apify sdk? I know the recommended approach is to use one single crawler per instance
but issue is it screws up the logging
I need for them to run sequentially for proper debugging
is there any way I can accomplish this?
I tried to run them like this
issue with this approach is crawler is skipped fully and crawler 2 exits on 1st request
saying browser closed unexpectedly
Can you provide some reproduction / full code of your implementation?
I guess, it should be fine if you use something like this:
like-gold•2y ago
hey, thanks I'll try this out to fix the issue
also if I'm not using apify, I can still use
Actor
helpers like openRequestQueue?
this does work, in separating out parameters such as maxRequestsPerMinute
and maxRequestsPerCrawl
but is Actor the only way I can open this queue?
because it gives me this warning that actor is not initialized
or should I initialize one even though I wont be using it?You don't need to use Actor class. There is the same functionality in Crawlee:
https://crawlee.dev/api/next/core/class/RequestQueue#open
https://crawlee.dev/docs/next/deployment/apify-platform#using-platform-storage-in-a-local-actor