genetic-orange•3y ago
How to recrawl initial page for new links without purging to keep track of duplicates?
Thank you for releasing Crawlee as an open source project.
I have setup a very simple CheerioCrawler that crawls the home page of a news website for starters.
I am then scraping certain articles and saving some information to an external database. I aim to run the crawler at a regular interval (every few hours) to check for new links/articles.
I want to keep track of what I've crawled in previous runs as to not re-visit those pages and waste resources (mine and the hosts), so I set CRAWLEE_PURGE_ON_START to false to keep track of what's been crawled.
Current State:
- Once I run the crawler once, the home page is marked as "handled" and never visited again on subsequent runs to look for new links within it.
Desired State:
- On each new run, crawl the same home page, and enqueue any and only new links found for handling/scaping.
Is there a way to make my starting home page (example.com) re-crawlable on each run without purging? I believe it's something I can add within the default handler, I'm just not sure what exactly it is.
Thank you.
10 Replies
ambitious-aqua•3y ago
I'm trying to achieve the same thing. Couldn't find a built-in crawlee solution for it either. I think the industry term is "real-time data extraction". For the duplication though, I came up with a solution.
I noticed that the unique request id of the queued urls never change. Not sure if this can change at a random time with a crawlee library update, I hope not.
So when I extract data from the page, I also record the request id of the page's url.
an example from my project:
I'll focus on the real-time extraction logic today. Ideally it should automatically switch to the real-time mode after the initial "full" scrape.
Though a built-in solution for these would be way nicer. Maybe there is one, idk, still exploring.
genetic-orangeOP•3y ago
Thanks for your reply. I don't have an issue with the duplication actually, I can rely on Crawlee's built-in deduplication and it functions well so far.
My main issue is re-crawling the same main page for new links. Crawlee considers the main page "crawled" and "handled" and as such on subsequent runs it does not crawl it anymore and the run ends without having crawled any pages at all.
genetic-orangeOP•3y ago
Actually found this discussion on the github repo that proved useful, maybe it'll be useful to you. It's a bit old, I think the interface for updating the request queue has changed, so I ended up deleting the request of my entry/start urls at the end of each run, using as you said the "unique" id of that particular request.
https://github.com/apify/crawlee/discussions/1322
Let me know if you find a better way.
GitHub
Option to not cache entry URLs · Discussion #1322 · apify/crawlee
Is there a method to not cache the URLs added to requestQueue? // Open the default request queue associated with the actor run const queue = await Apify.openRequestQueue(); // Open a named request ...
absent-sapphire•3y ago
I’d recommend using a named RequestQueue. Request queues by default don't allow the same URL to be crawled twice, so just assign a unique UUID to each request to the initial page, and it will only crawl new links on each run. Named storages also don’t get purged.
genetic-orangeOP•3y ago
That's a great suggestion in regards to the named RequestQueue, thank you. But can you clarify how I would assign a unique UUID to the initial page, I can't find any function in the documentation that gives me that ability.
@Warmanz just advanced to level 1! Thanks for your contributions! 🎉
absent-sapphire•3y ago
Download the UUID package and use the v4() function
genetic-orangeOP•3y ago
Hahaha thank you but I'm aware on how to generate a UUID, just not how to assign it to a request. Is it by creating a RequestOptions with a url and uniqueKey (the generated UUID) and giving that to .addRequests? Is that the right way to go?
This seems to work.
absent-sapphire•3y ago
Yup! Apologies for not understanding your question well. To explain why your solution works:
By default, Crawlee uses the request URL as the unique key for the request. So, if you enqueue two requests to
https://google.com
, only one will be added to the queue and handled. When you provide your own uniqueKey
, it goes off of that instead.
You can also use useExtendedUniqueKey
, which is great when making POST requests to the same URL, but with different payloads and headers (ex. when GraphQL scraping).This is something we have been discussing for years. Apify is now working on new request queue implementation that will make it easier to manipulate requests.
For now, my solution would be to simply store crawled URLs to a named dataset or KV store. You load that on every start and simply filter out when enqueueing.