sad-indigoS
Apify & Crawlee2y ago
3 replies
sad-indigo

Already crawled URLs

How does the crawler know if a URL has been already crawled? And how can I get a list of those already visited/crawled URLs?

I noticed some json files in the request queue have
orderNo=null
, is that it? And if not, why is it null?

My intention with these questions is also to try to understand why my request queue is growing (a lot). Perhaps I could "optimize" the crawler so if there's already too many requests in the queue I'd stop "enqueueLinks" but would keep note of those URLs so they could be scraped again (ie I'd "enqueueLinks") at a later time. What do you think?

Thank you!

EDIT:
I came to realize that the request queue keeps growing because already scrapped sources remain there (to make sure we don't scrap/crawl the same page twice). My question is: is there a way through the API to remove already crawled urls, in case the user wants to free some space (my case)?
Was this page helpful?