strock77
strock773mo ago

Safe to parallelize Dataset writes across processes?

Context: • Crawlee v3.13.10, Node 22 • Linux (ext4), using storage-local • Multiple forked workers share one RequestQueueV2 (with request locking) • Each worker does:
const dataset = await Dataset.open('default');
await dataset.pushData(item);
const dataset = await Dataset.open('default');
await dataset.pushData(item);
Is it safe for N processes to push to the same dataset concurrently with storage-local (no corruption/partial writes)? Any guarantees about atomicity / ordering? If not recommended, what’s the best pattern? • per-worker datasets then merge? Any flags/settings I should use to make this robust? Thanks!
2 Replies
Miny
Miny3mo ago
with storage-local on ext4 it’s technically safe. pushData writes atomically by appending complete JSON lines, so you won’t get corruption or partial writes even if N workers write at once. what you don’t get is strict ordering across processes: writes from different workers can interleave in whatever order the fs commits them. if ordering matters, you’d need a coordination layer. common pattern: give each worker its own dataset (default-1, default-2, etc) and then merge at the end. that way you keep clean separation and predictable order per worker. no extra flags needed, just make sure all workers share the same storage dir. if you need stronger guarantees (ordering/locking), you’d usually go with storage-cloud or an external db instead of storage-local.
strock77
strock77OP3mo ago
Thanks for the great explanation. I was thinking on going with a simple redis or sqlite storage, but since ordering is not important for me I'll go with the default dataset. Thanks!

Did you find this page helpful?