robust-apricot•2y ago

Go to solution to prevent recrawl?

What is the go to solution for when I crawl a website and all its Events everyday for new items? So it should not recrawl those pages that are already cralwed? Is there a build in solution in Crawlee or should I keep track of a list in some DB to prevent this? thanks

5 Replies

Lukas Krivka•2y ago

Crawlee doesn't recrawl the same URLs if you disable the default putging. But that happens for all URLs, usually you just want to skip the final details. IN that case, store already scraped URLs in named persisted dataset and then use that as a filter when enqueueing

fair-rose•2y ago

Is there a way to download/upload/modify that persisted dataset?

Lukas Krivka•2y ago

Datasets are generally append-only but you can re-create them. Locally they are stored as bunch of JSON files, on Apify it has nice API

fair-rose•2y ago

Thanks @Lukas Krivka . How about RequestQueue?

MEE6•2y ago

@Christian just advanced to level 2! Thanks for your contributions! 🎉

Go to solution to prevent recrawl?

Did you find this page helpful?