exotic-emerald•3y ago

running multiple crawler instances at once

im scraping entire sites and running multiple crawlers at once for each site - looking to scrape 50+ sites, and im running multiple site scrapes at once from a start file emitting an event emitter to run each site, specifically the

await crawler.run(startUrls)

await crawler.run(startUrls)

line. Should i run them all at once in one terminal or run each one in separate terminals with different scripts to run each scraper Also is this a maintanable approach to run multiple crawler instances at once
One final problem I am running into is that I'll run the start file but I get this request queue error when I run multiple crawlers when i run it again, it sometimes works, but it's inconsistent in how this error pops up request queue error:

ERROR CheerioCrawler:AutoscaledPool: isTaskReadyFunction failed       
ectory, open 'C:\Users\haris\OneDrive\Documents\GitHub\periodicScraper01\pscrape\storage\request_queues\default\1Rk4szfVGlTLik4.json'] {
  errno: -4058,
  code: 'ENOENT',
  syscall: 'open',
  path: 'C:\\Users\\haris\\OneDrive\\Documents\\GitHub\\periodicScraper01\\pscrape\\storage\\request_queues\\default\\1Rk4szfVGlTLik4.json'
}

ERROR CheerioCrawler:AutoscaledPool: isTaskReadyFunction failed       
ectory, open 'C:\Users\haris\OneDrive\Documents\GitHub\periodicScraper01\pscrape\storage\request_queues\default\1Rk4szfVGlTLik4.json'] {
  errno: -4058,
  code: 'ENOENT',
  syscall: 'open',
  path: 'C:\\Users\\haris\\OneDrive\\Documents\\GitHub\\periodicScraper01\\pscrape\\storage\\request_queues\\default\\1Rk4szfVGlTLik4.json'
}

9 Replies

Alexey Udovydchenko•3y ago

default requestQueue shared by crawlee instances, so to separate queue per crawler you need to name it but recommended approach to not mix cralwers without reason, i.e. consider to crawl all sites by single crawler or (better) run actor per single site with single crawler.

exotic-emeraldOP•3y ago

how do you separate requestQueues for each instance also is there a limit to how many instances can be run because when i run more than 5-6 crawlers, i get this error:

``

Error: This crawler instance is already running, you can add more requests to it via

crawler.addRequests()`. at CheerioCrawler.run

rival-black•3y ago

@harish Use a named requestQueue.

correct-apricot•3y ago

@harish like this:

// Open the 'my-queue' request queue
const queueWithName = await Actor.openRequestQueue('my-queue');

// Open the 'my-queue' request queue
const queueWithName = await Actor.openRequestQueue('my-queue');

vicious-gold•2y ago

Is there a reason to this? Not mixing of crawlers

like-gold•2y ago

hey is there any way to run multiple crawlers without Actors/apify sdk? I know the recommended approach is to use one single crawler per instance but issue is it screws up the logging I need for them to run sequentially for proper debugging is there any way I can accomplish this? I tried to run them like this

try{
await crawlerOne.run(urls)
}catch(err){
log.error(err)}

try{
await crawlerTwo.run(urls)
}catch(err){
log.error(err)}

try{
await crawlerOne.run(urls)
}catch(err){
log.error(err)}

try{
await crawlerTwo.run(urls)
}catch(err){
log.error(err)}

issue with this approach is crawler is skipped fully and crawler 2 exits on 1st request saying browser closed unexpectedly

Oleg V.•2y ago

Can you provide some reproduction / full code of your implementation? I guess, it should be fine if you use something like this:

const puppeteerRequestQueue = Actor.openRequestQueue('for-playwright');

const cheerioCrawler = new CheerioCrawler({
    requestHandler: createCheerioRouter(),
});

const puppeteerCrawler = new PuppeteerCrawler({
    requestQueue: puppeteerRequestQueue,
    requestHandler: createPuppeteerRouter(),
});

cheerioCrawler.router.addDefaultHandler(({ $, crawler }) => {
    // Add request to the CheerioCrawler request queue (default)
    if (SOMETHING) crawler.addRequests([REQUEST])
    // Our check tells us that this page must be handled with
    // Puppeteer, so we'll save the request in the puppeteerRequestQueue
    // to be handled after CheerioCrawler has finished
    else puppeteerRequestQueue.addRequest(REQUEST)
});

// ... Puppeteer handler

// Runs CheerioCrawler, which will maybe enqueue some links into
// the PuppeteerCrawler
await cheerioCrawler.run();

// If any requests were added to puppeteerRequestQueue, they'll be
// handled by Puppeteer now
await puppeteerCrawler.run();

const puppeteerRequestQueue = Actor.openRequestQueue('for-playwright');

const cheerioCrawler = new CheerioCrawler({
    requestHandler: createCheerioRouter(),
});

const puppeteerCrawler = new PuppeteerCrawler({
    requestQueue: puppeteerRequestQueue,
    requestHandler: createPuppeteerRouter(),
});

cheerioCrawler.router.addDefaultHandler(({ $, crawler }) => {
    // Add request to the CheerioCrawler request queue (default)
    if (SOMETHING) crawler.addRequests([REQUEST])
    // Our check tells us that this page must be handled with
    // Puppeteer, so we'll save the request in the puppeteerRequestQueue
    // to be handled after CheerioCrawler has finished
    else puppeteerRequestQueue.addRequest(REQUEST)
});

// ... Puppeteer handler

// Runs CheerioCrawler, which will maybe enqueue some links into
// the PuppeteerCrawler
await cheerioCrawler.run();

// If any requests were added to puppeteerRequestQueue, they'll be
// handled by Puppeteer now
await puppeteerCrawler.run();

like-gold•2y ago

hey, thanks I'll try this out to fix the issue also if I'm not using apify, I can still use Actor helpers like openRequestQueue? this does work, in separating out parameters such as maxRequestsPerMinute and maxRequestsPerCrawl but is Actor the only way I can open this queue? because it gives me this warning that actor is not initialized or should I initialize one even though I wont be using it?

Oleg V.•2y ago

You don't need to use Actor class. There is the same functionality in Crawlee: https://crawlee.dev/api/next/core/class/RequestQueue#open https://crawlee.dev/docs/next/deployment/apify-platform#using-platform-storage-in-a-local-actor

running multiple crawler instances at once

Did you find this page helpful?