like-gold•2y ago
Abort Crawler on Exception
Hello everyone! I'm trying to implement a scraper based on puppeteer. My logic looks like as simple as below:
What I'm trying to achieve is: when there's any exception thrown for
url_1
(e.g. http error code or any other exceptions inside the request handler), an exception will be also thrown after the first line, so that url_2
won't be scraped.
However, it looks like when there's an exception for url_1
, Crawlee handles it gracefully, and continues to execute the second line there. I searched around the doc, github issues, and this channel for a while but didn't find any luck.
Is there any configuration that I can do to achieve this?4 Replies
foreign-sapphire•2y ago
crawler.teardown()
is what you are looking for
you can access the crawler instance in your route handler but keep in mind urls which are alreadh being scraped will be scraped
i.e. the ones in the event loop, if you want to close even those down in the middle of scraping then you need to brute force it by using process.kill(1)
like-goldOP•2y ago
Thanks!! @AltairSama2
what about http code like 429? I'd also like to abort the crawler if that case happens, but it looks like I can't catch it in the route handler
foreign-sapphire•2y ago
if you can check for the http code, then you can just use an if/else to catch it
Im not sure about pupppeteer but playwright has an option where you can check the calls in the network tab,
You can get the status with response.status() but you might need to override default status codes https://crawlee.dev/api/core/interface/SessionPoolOptions#blockedStatusCodes