sunny-green
sunny-green3y ago

Is there way to store the state and continue?

Hello there, well, I am looking for a way to store current state that where is crawler is crawling, and if anything happen and error occured and crash, then we need to fix it and continue from there. For example, I wrote a program that crawls google's search page. And I want to crawl 1000+ more page, thus that should take a loooong time. While crawling, there was error occured due to our program's problem, like we missed special button of google's page. We fixed it now, but we have no way to continue from there, but we want to, because it already took like 5 hours and we have to spend more waste 5 hours.
8 Replies
xenial-black
xenial-black3y ago
I too am looking for a similar solution
sunny-green
sunny-greenOP3y ago
This should be occured to anyone who writing for long scraping, so.......
typical-coral
typical-coral3y ago
one of the ways is to set env var: both process.env.CRAWLEE_PURGE_ON_START = 'false'; and process.env.APIFY_PURGE_ON_START = 'false'; would work Also it's doable through configuration: https://crawlee.dev/docs/guides/configuration - with crwalee.json or configuration instance - parameters are described here: https://crawlee.dev/api/core/interface/ConfigurationOptions#purgeOnStart
sunny-green
sunny-greenOP3y ago
Thank you. As you said, we can use useState to manage progress state
typical-coral
typical-coral3y ago
@rikusen0335 keep in mind, that useState is a method where you need to explicitly provide the data that would be saved. And even if you would use state - by default next crawler start will purge the storages (including the request queue)
xenial-black
xenial-black3y ago
Can you go to a saved state on the existing crawler, or can you only use that method on a new crawler? We don't want to open a new crawler due to cookies and reauthentication complications.
typical-coral
typical-coral3y ago
useState() is used to keep certain files in memory - the values should be serializable and are saved to key-value-store periodically. You could go to existing state if you were using the state before and it was saved before. But if you need to continue the scraping where you left off before - you need to (probably the easiest way) set the env variable as mentioned above

Did you find this page helpful?