extended-salmon•2y ago

Already crawled URLs

How does the crawler know if a URL has been already crawled? And how can I get a list of those already visited/crawled URLs? I noticed some json files in the request queue have orderNo=null, is that it? And if not, why is it null? My intention with these questions is also to try to understand why my request queue is growing (a lot). Perhaps I could "optimize" the crawler so if there's already too many requests in the queue I'd stop "enqueueLinks" but would keep note of those URLs so they could be scraped again (ie I'd "enqueueLinks") at a later time. What do you think? Thank you! EDIT: I came to realize that the request queue keeps growing because already scrapped sources remain there (to make sure we don't scrap/crawl the same page twice). My question is: is there a way through the API to remove already crawled urls, in case the user wants to free some space (my case)?

3 Replies

lemurio•2y ago

Hi, you can delete a request from a request queue using the API or the Apify Client for python/javascript. If request's handledAt is not undefined, then it was crawled. What about the improved RequestQueue v2?

Discord

Discord - A New Way to Chat with Friends & Communities

Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.

API reference | Apify Documentation

extended-salmonOP•2y ago

Thank you @lemurio . The advantages of RequestQueueV2 are still not clear to me. I tried running the crawler with it for a bit, but the files saved in the storage seem to have exactly the same format. Did I miss something? Is there a place I could read a bit more about it? (the docs seem a bit lacking at the moment - which I understand as it seems to be under development still)

Lukas Krivka•2y ago

Basically, you need to list the requests in the queue. And then delete one by one.

Already crawled URLs

Did you find this page helpful?