Xeno
Xeno2mo ago

Thanks. Do you know if a key could be

Thanks. Do you know if a key could be stored on apify for whatever a user previously scraped? Also, I thought that apify automatically doesn't charge for duplicate records. Couldn't someone duplicate data in order to charge for more records?
8 Replies
rival-black
rival-black2mo ago
I'm not sure. I've not done something like that yet. But from what I've seen from other actors, they usually require the users to provide an id to a request queue containing previous results.
Lukas Krivka
Lukas Krivka2mo ago
Crawlee does deduplicate (by uniqueKey) inside one run but there isn't any default dedup in multiple runs. You can implement this by storing old IDs in KV or Dataset and then preload them and compare. 2 problems with this: 1. It adds complexity and possibly slower start/memory issues if it gets huge 2. Users might be running runs that return 0 results because there are no new data. So you have to handle it with some extra PPE event like actor-start
Xeno
XenoOP2mo ago
How does it determine the unique key? Does it check all columns? Thanks
Lukas Krivka
Lukas Krivka2mo ago
No, it is derived from the URL
Xeno
XenoOP2mo ago
Is it based on a distinct of all fields or combination of some fields? Name Address Email John Doe 123 Main St john@gmail.com John Doe 123 Main St john.doe@gmail.com Jane Smith 456 Oak Ave jane@gmail.com Would this return 2 results or 3?
Lukas Krivka
Lukas Krivka2mo ago
The Crawlee deduplication happens on the request URL when being scraped, it is not related to the output data. For specific Actor problem, just message the author in Issues.
Xeno
XenoOP2mo ago
Thanks. This is for my own actor. I'm guessing I don't use Crawlee if I don't import that module
Lukas Krivka
Lukas Krivka2mo ago
The deduplication happens with RequestQueue so you might use that without Crawlee but Crawlee is a best way to use it

Did you find this page helpful?