Append data to an existing dataset

Question

Is there a way to redirect the output of multiple runs of the same scraper to the same existing dataset, appending the new records? The order doesn't matter. Due to the limitations of the scraper I am using, I need to perform thousands of runs that produce a very small amount of output that I would like to add to an existing dataset (obviously having the same format or schema). I skim through the Apify API documentation and I did not find anything about it.

!!!Joefree!!! 👑 · Accepted Answer

Ahh. yes if you are using preexisting Actor then no way to redirect the output. Unless the Actor have a parameter to support Custom Dataset. Otherwise you can create an "Integration Actor" which will redirect the output to custom dataset

!!!Joefree!!! 👑 · Answer

One way is to use "named Dataset" https://docs.apify.com/sdk/python/reference/class/Actor#open_dataset

!!!Joefree!!! 👑 · Answer

example :

!!!Joefree!!! 👑 · Answer

ds = await Actor.open_dataset(name="my-dataset")
await ds.push_data(data)

cesio · Answer

Thanks for the quick answer. You are redirecting me to the SDK, are there an equivalent method for the REST API? (I have to integrate the calls with existing Java code). Put in another way, I'm asking it there is a method to call/run an actor specifying an existing dataset. I'm new to Apify, I don't have a clear picture of the platform at the moment.

cesio · Answer

It would be quite strange if the SDK could do things that are not possible via the API, unless the merge was a local operation.

!!!Joefree!!! 👑 · Answer

https://docs.apify.com/api/v2/datasets-post

!!!Joefree!!! 👑 · Answer

and https://docs.apify.com/api/v2/dataset-items-post

cesio · Answer

If I understand correctly, each run has its own new storage, no way to specify an existing one. To do a merge, I need to take every single storage created by the run and put it into the overall previously created remote storage. This is what the SDK does too, I guess.

ApifyBot · Answer

@cesio just advanced to level 1! Thanks for your contributions! 🎉

!!!Joefree!!! 👑 · Answer

Yes, every run has DEFAULT storage, but you can ignore the DEFAULT storage and use your own CUSTOM dataset: eg: ds.push_data(data)

!!!Joefree!!! 👑 · Answer

otherwise Actor.push_data(data) will use the DEFAULT dataset

cesio · Answer

Ok, I don't get what "ds" is because I don't know the SDK (you are referring to Python SDK I suppose) but I grasp the overall picture. Thanks.

!!!Joefree!!! 👑 · Answer

ds is the CUSTOM dataset defined earlier: ds = await Actor.open_dataset(name="my-dataset")

!!!Joefree!!! 👑 · Answer

@Helper may have more insight

!!!Joefree!!! 👑 · Answer

are you using pre existing Actor or writing new one ?

cesio · Answer

Ok, that makes sense 😅 . I don't want to sound rude but I'm focused on the REST API and not the SDK. As I said, I need to integrate it with existing Java code. I need to go down to a lower level than the one offered by the SDK. Given the links above, I understand how to create a remote dataset and how to store local data on it. From your answers I got that there is no way to instruct an actor to use an existing dataset (CUSTOM): you need to take the data from the DEFAULT dataset created by the run and move/copy it to the CUSTOM dataset.

cesio · Answer

For instance, if I have 1000 runs I will have 1000 small datasets plus the integration dataset.

cesio · Answer

I can't tell an actor to use a certain pre-existing dataset when I call it. The actor instance use its own dataset (DEFAULT). End of the story. Then, I can move/copy this dataset to a larger one.

cesio · Answer

So there are actors that might support a "custom dataset" parameter. Good to know. 😁 That is not a standard feature.

cesio · Answer

Thanks again. I'll check out the integration actors. Have a nice day.

!!!Joefree!!! 👑 · Answer

👍🏻

Apify Discord Mirror

Append data to an existing dataset