dependent-tan•2y ago

Strategy to prevent crawling data that has been crawled before

I have a web page with pagination, and in those paginations, contain links that I need to crawl. These paginations will be added new items periodically, with newer items appear on top. I am thinking of a pseudocode to just crawl what's needed. Something like this: - For page 1 to n - Collect item links - For each link - If link is visited, exit/shutdown scraper completely - Else put into DETAIL queue Later on, on DETAIL handler: - Scrape - Mark link as visited Now, I am thinking on how to actually mark this link as visited. I am thinking of having the crawling script connect to a database where the link is primary key, and just check if the link is already in database or not. I am assuming that Crawlee on Apify platform would be able to create a connection to an external database. Please correct me if I am wrong. Am I overcomplicating stuffs or is there a better idea?

5 Replies

HonzaS•2y ago

requestQueue is already doing all that, visited links are not procesed again, see https://crawlee.dev/api/core/class/RequestQueue

dependent-tanOP•2y ago

@HonzaS thank you, I didn't know that. @HonzaS I think that won't work. I am specifically looking for if I can call the visited link during crawlee run, so that I know to not paginate

MEE6•2y ago

@Christian just advanced to level 1! Thanks for your contributions! 🎉

Lukas Krivka•2y ago

See answer here https://discord.com/channels/801163717915574323/1220005298450993233/1220005298450993233

dependent-tanOP•2y ago

Thanks @Lukas Krivka

Strategy to prevent crawling data that has been crawled before

Did you find this page helpful?