Strategy to prevent crawling data that has been crawled before
I have a web page with pagination, and in those paginations, contain links that I need to crawl. These paginations will be added new items periodically, with newer items appear on top.
I am thinking of a pseudocode to just crawl what's needed. Something like this:
I am assuming that Crawlee on Apify platform would be able to create a connection to an external database. Please correct me if I am wrong.
Am I overcomplicating stuffs or is there a better idea?
I am thinking of a pseudocode to just crawl what's needed. Something like this:
- For page 1 to n
- Collect item links
- For each link
- If link is visited, exit/shutdown scraper completely
- Else put into
queueDETAIL
DETAIL handler:- Scrape
- Mark link as visited
I am assuming that Crawlee on Apify platform would be able to create a connection to an external database. Please correct me if I am wrong.
Am I overcomplicating stuffs or is there a better idea?