Apify & CrawleeA&CApify & Crawlee
Powered by
liamk-ultraL
Apify & Crawlee•7mo ago•
14 replies
liamk-ultra

StorageClients w/ Multiple Crawlers

Hi!

This is my first time using Crawlee, and ... so far, so good. It's working.
However, I noticed it was using the default FileSystemStorage and creating files locally on my development machine. That's less than ideal in production.
Changing to MemoryStorageClient revealed some other problems.

I'm running multiple PlaywrightCrawlers asynchronously. The reason for that is that I want to process the scraped documents in a batch, i.e. per site.
Also, it's easier to keep things isolated that way. (Each target has it's own set of starting urls, link patterns to enqueue, and selectors to select.)

However, this fails with MemoryStorageClient because the first crawler gets the memory, and subsequent ones generate an error:
Error crawling target dummy2: Service StorageClient is already in use. Existing value: <crawlee.storage_clients._memory._memory_storage_client.MemoryStorageClient object at 0x335e5ef30>, attempted new value: <crawlee.storage_clients._memory._memory_storage_client.MemoryStorageClient object at 0x335e7c710>.
Error crawling target dummy2: Service StorageClient is already in use. Existing value: <crawlee.storage_clients._memory._memory_storage_client.MemoryStorageClient object at 0x335e5ef30>, attempted new value: <crawlee.storage_clients._memory._memory_storage_client.MemoryStorageClient object at 0x335e7c710>.

Upon investigation I discovered the docs saying:

The FileSystemStorageClient is not safe for concurrent access from multiple crawler processes. Use it only when running a single crawler process at a time.

So, even though it appears to be working with some basic tests, I'm not confident this approach will work. I actually don't want concurrent access, I want the storage to be separated, on a per-crawler basis. (Or otherwise, segmented within the Memory or File storage.) I'm not opposed to pointing it at '/tmp' in production, but the warning makes me doubtful that it would work correctly.
I did try creating multiple memory clients by setting unique queue_id, store_id and dataset_id, but that resulted in the same error.

Is this a limitation, or perhaps is there some way of doing what I'm trying to do in some other way?

Thanks for your help!
Apify & Crawlee banner
Apify & CrawleeJoin
This is the official developer community of Apify and Crawlee.
13,739Members
Resources
Recent Announcements

Similar Threads

Was this page helpful?
Recent Announcements
ellativity

**The Apify $1M Challenge is over!** For everyone who joined yesterday’s Award Ceremony livestream for the Apify $1M Challenge, thank you for your enthusiastic drumrolls in the chat and positive vibes. We were really feeling the excitement and celebratory mood! If you missed the stream or just want to rewatch the key moments again, here’s the replay link https://www.youtube.com/watch?v=eEDV-5X43Gg (ngl, the replay is not the same without your live chat) And, if you didn’t check the email that should have landed in your inboxes, we’d love to hear about your experience of participating in the Apify $1M Challenge. **<a:alerthulk:1468892073917939713> Win one of five $100 Visa gift cards by completing the end-of-challenge survey here: https://apify.typeform.com/to/mjoMaZqD** Thank you again to everyone who participated in any capacity. The past 3 months have been a wild ride and we feel so grateful to have been on this adventure with y’all. We mean every word when we say how much you’ve impressed us. Thank you all from the bottom of our hearts. <a:keanuthanks:1430839059655426068> Saurav and Ella xoxo PS - if you just want to jump to the spoilers, a full list of winners is available at https://apify.com/challenge 🏆

ellativity · 5d ago

ellativity

**You are invited** ... to celebrate all the achievements of the Apify $1M Challenge with us on Wednesday, February 4 at **8 AM PT / 11 AM ET / 4 PM GMT / 5 PM CET / 9:30 PM IST / 12 AM +1d CST** We will be announcing winners of the Grand Prizes, as well as regional winners and much more, with especially good news for all participating developers. 🏆 We look forward to sharing with you all! 🎉 More info here: https://luma.com/6c1493t0

ellativity · 2w ago

ellativity

Hi @everyone 👋 I'm hanging out in https://discord.com/channels/801163717915574323/1430491198145167371 for the next 45 min, if you want to discuss the end of the challenge or anything else.

ellativity · 2w ago

Similar Threads

Crawlee with multiple Crawlers?
foolish-indigoFfoolish-indigo / crawlee-python
13mo ago
Multiple Datasets with different schemes
managerial-maroonMmanagerial-maroon / crawlee-python
2y ago
Python API multiple unnamed create_dataset
!!!Joefree!!! 👑!!!!Joefree!!! 👑 / crawlee-python
2y ago
Using the Actor instance across multiple files
dramatic-maroonDdramatic-maroon / crawlee-python
2y ago