extended-salmon
extended-salmon2y ago

Already crawled URLs

How does the crawler know if a URL has been already crawled? And how can I get a list of those already visited/crawled URLs? I noticed some json files in the request queue have orderNo=null, is that it? And if not, why is it null? My intention with these questions is also to try to understand why my request queue is growing (a lot). Perhaps I could "optimize" the crawler so if there's already too many requests in the queue I'd stop "enqueueLinks" but would keep note of those URLs so they could be scraped again (ie I'd "enqueueLinks") at a later time. What do you think? Thank you! EDIT: I came to realize that the request queue keeps growing because already scrapped sources remain there (to make sure we don't scrap/crawl the same page twice). My question is: is there a way through the API to remove already crawled urls, in case the user wants to free some space (my case)?
3 Replies
lemurio
lemurio2y ago
Hi, you can delete a request from a request queue using the API or the Apify Client for python/javascript. If request's handledAt is not undefined, then it was crawled. What about the improved RequestQueue v2?
Discord
Discord - A New Way to Chat with Friends & Communities
Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.
extended-salmon
extended-salmonOP2y ago
Thank you @lemurio . The advantages of RequestQueueV2 are still not clear to me. I tried running the crawler with it for a bit, but the files saved in the storage seem to have exactly the same format. Did I miss something? Is there a place I could read a bit more about it? (the docs seem a bit lacking at the moment - which I understand as it seems to be under development still)
Lukas Krivka
Lukas Krivka2y ago
Basically, you need to list the requests in the queue. And then delete one by one.

Did you find this page helpful?