like-gold•2y ago

Abort Crawler on Exception

Hello everyone! I'm trying to implement a scraper based on puppeteer. My logic looks like as simple as below:

crawler.run([{label: 'Label 1', url: <url_1>}])
crawler.run([{label: 'Label 2', url: <url_2>}])

crawler.run([{label: 'Label 1', url: <url_1>}])
crawler.run([{label: 'Label 2', url: <url_2>}])

What I'm trying to achieve is: when there's any exception thrown for url_1 (e.g. http error code or any other exceptions inside the request handler), an exception will be also thrown after the first line, so that url_2 won't be scraped. However, it looks like when there's an exception for url_1, Crawlee handles it gracefully, and continues to execute the second line there. I searched around the doc, github issues, and this channel for a while but didn't find any luck. Is there any configuration that I can do to achieve this?

4 Replies

foreign-sapphire•2y ago

crawler.teardown() is what you are looking for you can access the crawler instance in your route handler but keep in mind urls which are alreadh being scraped will be scraped i.e. the ones in the event loop, if you want to close even those down in the middle of scraping then you need to brute force it by using process.kill(1)

like-goldOP•2y ago

Thanks!! @AltairSama2 what about http code like 429? I'd also like to abort the crawler if that case happens, but it looks like I can't catch it in the route handler

foreign-sapphire•2y ago

if you can check for the http code, then you can just use an if/else to catch it Im not sure about pupppeteer but playwright has an option where you can check the calls in the network tab,

Lukas Krivka•2y ago

You can get the status with response.status() but you might need to override default status codes https://crawlee.dev/api/core/interface/SessionPoolOptions#blockedStatusCodes

SessionPoolOptions | API | Crawlee

Abort Crawler on Exception

Did you find this page helpful?