xenial-black
xenial-black2y ago

Request queue taking to much space on disk?

Hi, My request queue is taking almost as much space on disk as my scraped data (more than 100k scraped items now). The request queue takes almost 700MB. I wonder if there's a way to make it take less space on disk? I noticed there's a RequestQueueV2 but I'm not sure what benefits this brings compared to "V1". Somehow related: is there a way to split files in multiple folders to avoid having potentially millions of files in a single folder? (this for request queues and datasets) Any feedback would be appreciated. Thank you!
1 Reply
Oleg V.
Oleg V.2y ago
RequestQueueV2 offers several improvements over RequestQueueV1, including better performance and lower disk usage. One of the key differences is how it stores data on disk. In RequestQueueV1, each request is stored as an individual file on disk, which can lead to significant disk usage, especially when dealing with a large number of requests. On the other hand, RequestQueueV2 uses a more efficient storage format, which can reduce disk usage and improve performance. Otherwise, review / improve your scraping logic. Maybe You can avoid enqueuing so many requests to RQ somehow. Regarding your second question about splitting files into multiple folders to avoid having potentially millions of files in a single folder, Apify doesn't directly support this feature out of the box. However, you can achieve this by manually partitioning your data and storing it in separate folders based on some criteria, such as timestamp, category, or any other relevant attribute. By organizing your data into folders based on the year and month of scraping, you can avoid having millions of files in a single folder and improve filesystem performance. For example You can use named datasets / requestQueue: https://crawlee.dev/api/next/core/class/Dataset#open https://crawlee.dev/api/next/core/class/RequestQueueV2#open

Did you find this page helpful?