equal-aqua
equal-aqua•2y ago

Scarping at scale with external message queue

Hey, I'm trying to figure out how to scrape on scale with external message queue. I'm using redis. External process publishes URLs to redis. Now crawlee should pick up this messages and get the plain HTML from page and store it to S3. I'm trying to figure out what is the best way to do this. 1. Right now crawlee is controlled by in memory request queue (native RequestQueue). My idea is to listen to messages from external queue and add it to in memory queue by calling .addRequests(). Concern here is that crawlee will finish it's run after first message in queue, and I'm not sure how adding new message later will work. Does it spin up new browser instance each time I call run? My goal is to to avoid creating new crawlee instance for each request. Ideally I would like to use single instance with session management. Is this viable approach? 2. Somewhere at the end of requestHandler() I should write logic for uploading data to S3?
6 Replies
Lukas Krivka
Lukas Krivka•2y ago
2. Yeah, that sounds easiest. You could reimplement Dataset.pushData but just changing it to S3 upload is fine. 1. You could reimplement the queue, there is an interface which different queues (in memory, disk based, Apify) implement and it would be doing calls to your system in the background. But it is not really needed if you only want to listen and enqueue. For your case, you really just need to change the default autoscaledPool behavior which dictates when the crawler should pick up a new job, when it should finish etc. By default it should be looking in the queue in a background loop for new requests so that also can stay the same. You only need to change the isFinishedFunction so it always returns false https://crawlee.dev/api/core/interface/AutoscaledPoolOptions
equal-aqua
equal-aquaOP•2y ago
@Lukas Krivka Thanks for the info. I ended up using default request queue and I just enqueue links that arrive from external process. Also on handler I have a logic to upload html content to s3.
adverse-sapphire
adverse-sapphire•2y ago
hi @OvidiuS sorry, but have you done this yet?
equal-aqua
equal-aquaOP•2y ago
hey, yes it's working. I followed the docs and I had a look at crawlee source to get it done
MEE6
MEE6•2y ago
@OvidiuS just advanced to level 1! Thanks for your contributions! 🎉
adverse-sapphire
adverse-sapphire•2y ago
@OvidiuS Cool. Can you share to me some code that you did do this? I'm trying to do the same thing, but it doesn't work I also have a post here. Please look it when you have a chance https://discord.com/channels/801163717915574323/1196578445006741524

Did you find this page helpful?