Apify Discord Mirror

Updated 5 months ago

Best practices/examples of hardening an actor that handles tens of thousands of records?

At a glance

The community member is looking for help with writing actors that can split a dataset into paged collections for batching, with the ability to cap the total records processed and control the page/batch size. The dataset contains records with a URL in one of the keys, and the goal is to fetch and save the images locally, while ensuring the actor can stop and resume without redundant operations. The end result should be a zipped archive with the images nested under directories named based on the identifier key of each record.

The comments suggest the community member is interested in using the RequestQueue feature outside of scrapers, to queue up image URLs for downloading. However, it's unclear if the RequestQueue is intended only for use with playwright/puppeteer/crawlee, or if it can be used more broadly.

There is no explicitly marked answer in the provided information.

I was told to post this here instead of #chat by DanielDo:

I'm looking for any helpful links/articles/source code for writing actors that split a collection of objects from a dataset into paged collections for batching? I want to support actor input for capping the total dataset records that are allowed to be processed, the size of each page/batch, etc.

The objects retrieved will have a url in one of their keys that the actor will then go fetch and save to the local fs, so I'd like to make sure the actor can stop and resume where it left off without redundant fetches or fs operations.

The end goal is to go from having a dataset with records in the shape of { image: 'https://..../x.png', identifier: 'My Image' } to a zipped archive of all of the images–and the images will be nested under parent directories that are named based on the identifier key of a given record.
s
4 comments
So, for a record of { image: 'https://..../x.png', identifier: 'My Image' }

I will end up with an archive that when unzipped, will produce the following:

Plain Text
- Archive
  - My Image
    - x.png
Anyone? Could really use some help on this. Docs give just enough to spark my interest / mention in passing.
It'd be great if RequestQueue could be used outside of scrapers–can we use it for queueing up image URLs to download?
Or is it only intended to be passed into playwright/puppeteer/crawlee?
Add a reply
Sign up and join the conversation on Discord