robust-apricot•2y ago
Go to solution to prevent recrawl?
What is the go to solution for when I crawl a website and all its Events everyday for new items? So it should not recrawl those pages that are already cralwed? Is there a build in solution in Crawlee or should I keep track of a list in some DB to prevent this? thanks
5 Replies
Crawlee doesn't recrawl the same URLs if you disable the default putging. But that happens for all URLs, usually you just want to skip the final details. IN that case, store already scraped URLs in named persisted dataset and then use that as a filter when enqueueing
fair-rose•2y ago
Is there a way to download/upload/modify that persisted dataset?
Datasets are generally append-only but you can re-create them. Locally they are stored as bunch of JSON files, on Apify it has nice API
fair-rose•2y ago
Thanks @Lukas Krivka . How about RequestQueue?
@Christian just advanced to level 2! Thanks for your contributions! 🎉