Scarping at scale with external message queue

Hey, I'm trying to figure out how to scrape on scale with external message queue.
I'm using redis. External process publishes URLs to redis. Now crawlee should pick up this messages and get the plain HTML from page and store it to S3.
I'm trying to figure out what is the best way to do this.

  1. Right now crawlee is controlled by in memory request queue (native RequestQueue). My idea is to listen to messages from external queue and add it to in memory queue by calling .addRequests(). Concern here is that crawlee will finish it's run after first message in queue, and I'm not sure how adding new message later will work. Does it spin up new browser instance each time I call run? My goal is to to avoid creating new crawlee instance for each request. Ideally I would like to use single instance with session management. Is this viable approach?
  2. Somewhere at the end of requestHandler() I should write logic for uploading data to S3?
Was this page helpful?