Thanks. Do you know if a key could be
Thanks. Do you know if a key could be stored on apify for whatever a user previously scraped? Also, I thought that apify automatically doesn't charge for duplicate records. Couldn't someone duplicate data in order to charge for more records?
8 Replies
rival-black•2mo ago
I'm not sure. I've not done something like that yet. But from what I've seen from other actors, they usually require the users to provide an id to a request queue containing previous results.
Crawlee does deduplicate (by uniqueKey) inside one run but there isn't any default dedup in multiple runs.
You can implement this by storing old IDs in KV or Dataset and then preload them and compare. 2 problems with this:
1. It adds complexity and possibly slower start/memory issues if it gets huge
2. Users might be running runs that return 0 results because there are no new data. So you have to handle it with some extra PPE event like actor-start
How does it determine the unique key? Does it check all columns?
Thanks
No, it is derived from the URL
Is it based on a distinct of all fields or combination of some fields?
Name Address Email
John Doe 123 Main St john@gmail.com
John Doe 123 Main St john.doe@gmail.com
Jane Smith 456 Oak Ave jane@gmail.com
Would this return 2 results or 3?
The Crawlee deduplication happens on the request URL when being scraped, it is not related to the output data. For specific Actor problem, just message the author in Issues.
Thanks. This is for my own actor. I'm guessing I don't use Crawlee if I don't import that module
The deduplication happens with RequestQueue so you might use that without Crawlee but Crawlee is a best way to use it