exotic-emerald
exotic-emerald3y ago

running multiple crawler instances at once

im scraping entire sites and running multiple crawlers at once for each site - looking to scrape 50+ sites, and im running multiple site scrapes at once from a start file emitting an event emitter to run each site, specifically the
await crawler.run(startUrls)
await crawler.run(startUrls)
line. Should i run them all at once in one terminal or run each one in separate terminals with different scripts to run each scraper Also is this a maintanable approach to run multiple crawler instances at once
One final problem I am running into is that I'll run the start file but I get this request queue error when I run multiple crawlers when i run it again, it sometimes works, but it's inconsistent in how this error pops up request queue error:
ERROR CheerioCrawler:AutoscaledPool: isTaskReadyFunction failed
ectory, open 'C:\Users\haris\OneDrive\Documents\GitHub\periodicScraper01\pscrape\storage\request_queues\default\1Rk4szfVGlTLik4.json'] {
errno: -4058,
code: 'ENOENT',
syscall: 'open',
path: 'C:\\Users\\haris\\OneDrive\\Documents\\GitHub\\periodicScraper01\\pscrape\\storage\\request_queues\\default\\1Rk4szfVGlTLik4.json'
}
ERROR CheerioCrawler:AutoscaledPool: isTaskReadyFunction failed
ectory, open 'C:\Users\haris\OneDrive\Documents\GitHub\periodicScraper01\pscrape\storage\request_queues\default\1Rk4szfVGlTLik4.json'] {
errno: -4058,
code: 'ENOENT',
syscall: 'open',
path: 'C:\\Users\\haris\\OneDrive\\Documents\\GitHub\\periodicScraper01\\pscrape\\storage\\request_queues\\default\\1Rk4szfVGlTLik4.json'
}
9 Replies
Alexey Udovydchenko
default requestQueue shared by crawlee instances, so to separate queue per crawler you need to name it but recommended approach to not mix cralwers without reason, i.e. consider to crawl all sites by single crawler or (better) run actor per single site with single crawler.
exotic-emerald
exotic-emeraldOP3y ago
how do you separate requestQueues for each instance also is there a limit to how many instances can be run because when i run more than 5-6 crawlers, i get this error: `` Error: This crawler instance is already running, you can add more requests to it via crawler.addRequests()`. at CheerioCrawler.run
rival-black
rival-black3y ago
@harish Use a named requestQueue.
correct-apricot
correct-apricot3y ago
@harish like this:
// Open the 'my-queue' request queue
const queueWithName = await Actor.openRequestQueue('my-queue');
// Open the 'my-queue' request queue
const queueWithName = await Actor.openRequestQueue('my-queue');
vicious-gold
vicious-gold2y ago
Is there a reason to this? Not mixing of crawlers
like-gold
like-gold2y ago
hey is there any way to run multiple crawlers without Actors/apify sdk? I know the recommended approach is to use one single crawler per instance but issue is it screws up the logging I need for them to run sequentially for proper debugging is there any way I can accomplish this? I tried to run them like this
try{
await crawlerOne.run(urls)
}catch(err){
log.error(err)}

try{
await crawlerTwo.run(urls)
}catch(err){
log.error(err)}
try{
await crawlerOne.run(urls)
}catch(err){
log.error(err)}

try{
await crawlerTwo.run(urls)
}catch(err){
log.error(err)}
issue with this approach is crawler is skipped fully and crawler 2 exits on 1st request saying browser closed unexpectedly
Oleg V.
Oleg V.2y ago
Can you provide some reproduction / full code of your implementation? I guess, it should be fine if you use something like this:
const puppeteerRequestQueue = Actor.openRequestQueue('for-playwright');

const cheerioCrawler = new CheerioCrawler({
requestHandler: createCheerioRouter(),
});

const puppeteerCrawler = new PuppeteerCrawler({
requestQueue: puppeteerRequestQueue,
requestHandler: createPuppeteerRouter(),
});

cheerioCrawler.router.addDefaultHandler(({ $, crawler }) => {
// Add request to the CheerioCrawler request queue (default)
if (SOMETHING) crawler.addRequests([REQUEST])
// Our check tells us that this page must be handled with
// Puppeteer, so we'll save the request in the puppeteerRequestQueue
// to be handled after CheerioCrawler has finished
else puppeteerRequestQueue.addRequest(REQUEST)
});

// ... Puppeteer handler

// Runs CheerioCrawler, which will maybe enqueue some links into
// the PuppeteerCrawler
await cheerioCrawler.run();

// If any requests were added to puppeteerRequestQueue, they'll be
// handled by Puppeteer now
await puppeteerCrawler.run();
const puppeteerRequestQueue = Actor.openRequestQueue('for-playwright');

const cheerioCrawler = new CheerioCrawler({
requestHandler: createCheerioRouter(),
});

const puppeteerCrawler = new PuppeteerCrawler({
requestQueue: puppeteerRequestQueue,
requestHandler: createPuppeteerRouter(),
});

cheerioCrawler.router.addDefaultHandler(({ $, crawler }) => {
// Add request to the CheerioCrawler request queue (default)
if (SOMETHING) crawler.addRequests([REQUEST])
// Our check tells us that this page must be handled with
// Puppeteer, so we'll save the request in the puppeteerRequestQueue
// to be handled after CheerioCrawler has finished
else puppeteerRequestQueue.addRequest(REQUEST)
});

// ... Puppeteer handler

// Runs CheerioCrawler, which will maybe enqueue some links into
// the PuppeteerCrawler
await cheerioCrawler.run();

// If any requests were added to puppeteerRequestQueue, they'll be
// handled by Puppeteer now
await puppeteerCrawler.run();
like-gold
like-gold2y ago
hey, thanks I'll try this out to fix the issue also if I'm not using apify, I can still use Actor helpers like openRequestQueue? this does work, in separating out parameters such as maxRequestsPerMinute and maxRequestsPerCrawl but is Actor the only way I can open this queue? because it gives me this warning that actor is not initialized or should I initialize one even though I wont be using it?

Did you find this page helpful?