Strategy to prevent crawling data that has been crawled before

I have a web page with pagination, and in those paginations, contain links that I need to crawl. These paginations will be added new items periodically, with newer items appear on top.

I am thinking of a pseudocode to just crawl what's needed. Something like this:
  • For page 1 to n
    • Collect item links
    • For each link
      • If link is visited, exit/shutdown scraper completely
      • Else put into
        DETAIL
        queue
Later on, on
DETAIL
handler:
  • Scrape
  • Mark link as visited
Now, I am thinking on how to actually mark this link as visited. I am thinking of having the crawling script connect to a database where the link is primary key, and just check if the link is already in database or not.

I am assuming that Crawlee on Apify platform would be able to create a connection to an external database. Please correct me if I am wrong.

Am I overcomplicating stuffs or is there a better idea?
Was this page helpful?