multiple-amethyst•2y ago
How to close the crawler from a RequestHandler?
Hey folks, I want to stop the scraper/crawler if I hit some arbritrary condition. Is there a way that I can do so from inside the RequestHandler? the closest function that I found is
crawler.teardown()
but it cant be executed inside a handler,5 Replies
Instead of
await crawler.run()
just crawler.run()
and then teardown
when you condition or event will be handled by your own code outside of crawlermultiple-amethystOP•2y ago
issue is the conditions are triggered in specific routes of a site
for eg. we have a resume function in our selenium scrapers which checks for duplicates and if some n number appear in a row we stop scraping assuming the rest of the data will have been scraped already too
plus we have a couple of other such conditions, it would be helpful if something like this was present inside the request handlers
similar question, how do I stop the request handler flow if some condition is satisfied? e.g. if some element is not present, stop the function right there. Will a simple
return;
suffice? since we anwyays dont return anything and just enqueue links.
can someone help wth this? could really use this functionality to avoid duplicate/redundant scrapes etc
or is there a way we can empty out request queue? I think this might work since the crawler will stop as soon as it sees there's nothing to scrape@AltairSama2 just advanced to level 4! Thanks for your contributions! 🎉
For reference answered here https://discord.com/channels/801163717915574323/1075487274424352888/1179490876850974810
If you want to stop the request handler itself, you need to have a condition at that point, JS doesn't allow cancelling functions/promises from the outside.
multiple-amethystOP•2y ago
yeah, that should not be an issue since we can control the number of requests via maxConcurrency
thanks!