De-duplicate dataset results

At a glance

The community member has an actor that returns a list of IDs, and they are concerned about duplicate results due to concurrent processes. They have considered using the returned ID as the key in the dataset, but this doesn't work because each result is a separate JSON file. Another idea was to open the dataset, get the full list of IDs, and only push IDs not present, but this adds overhead and introduces the possibility of race conditions.

In the comments, two possible solutions are suggested: 1) Create a global object to save the IDs for all entries, and check and save the ID for each entry before pushing it to the dataset. 2) Use an actor to remove duplicates from the dataset after the original actor finishes.

Useful resources

ccdslash

I have an actor that returns a simple list of IDs. It's possible that during a run, concurrent processes can overlap and produce duplicate results. Is there any accepted way of avoiding this?

At the most basic level I'd hoped that I could do something simple like using the returned ID as the key in the dataset (i.e. a duplicate result would write the same entry so a duplicate would not be created), but this doesn't seem to work, presumably because each result is actually a separate JSON file in the dataset.

I've also thought about opening the dataset and getting the full list of IDs, then only pushing IDs not present - this could work but adds overhead and also seems to introduce the possibility of race conditions.

So, is there any way to push only unique values to the dataset?

1 comment

HHamza

There are two possible solutions:
1- Create a global object, which you use to save the IDs for all entries, you could check and save the ID for each entry before pushing it to the dataset.
2- You could use this Actor to remove duplicates, by running it on the dataset of your Actor after it finishes: https://apify.com/lukaskrivka/dedup-datasets

Add a reply

Apify Discord Mirror

De-duplicate dataset results