Apify Discord Mirror

Updated 5 months ago

How to close the crawler from a RequestHandler?

At a glance

The community members are discussing how to stop a scraper/crawler when certain conditions are met. The original poster wants to know if there is a way to stop the scraper from inside the RequestHandler, as the crawler.teardown() function cannot be executed from within the handler.

The comments suggest that instead of using await crawler.run(), the community member should use crawler.run() and then call teardown() when the condition or event is handled in their own code outside of the crawler. However, the issue is that the conditions are triggered in specific routes of a site, and the community members would like to have this functionality within the request handlers.

The community members also discuss how to stop the request handler flow if a certain condition is satisfied, such as if an element is not present. They wonder if a simple return; would suffice, as they don't return anything and just enqueue links.

The community members are looking for a way to avoid duplicate or redundant scrapes, and they wonder if there is a way to empty out the request queue, as this might stop the crawler when there is nothing left to scrape.

There is a reference to a previous discussion on Discord, but no explicitly

Useful resources
Hey folks, I want to stop the scraper/crawler if I hit some arbritrary condition. Is there a way that I can do so from inside the RequestHandler? the closest function that I found is crawler.teardown() but it cant be executed inside a handler,
1
A
A
A
12 comments
Instead of await crawler.run() just crawler.run() and then teardown when you condition or event will be handled by your own code outside of crawler
issue is the conditions are triggered in specific routes of a site
for eg. we have a resume function in our selenium scrapers which checks for duplicates and if some n number appear in a row we stop scraping assuming the rest of the data will have been scraped already too
plus we have a couple of other such conditions, it would be helpful if something like this was present inside the request handlers
similar question, how do I stop the request handler flow if some condition is satisfied? e.g. if some element is not present, stop the function right there. Will a simple return; suffice? since we anwyays dont return anything and just enqueue links.
can someone help wth this? could really use this functionality to avoid duplicate/redundant scrapes etc
or is there a way we can empty out request queue? I think this might work since the crawler will stop as soon as it sees there's nothing to scrape
just advanced to level 4! Thanks for your contributions! πŸŽ‰
If you want to stop the request handler itself, you need to have a condition at that point, JS doesn't allow cancelling functions/promises from the outside.
yeah, that should not be an issue since we can control the number of requests via maxConcurrency
Add a reply
Sign up and join the conversation on Discord