6 replies

External request queue + external result storage, Crawlee as daemon process - how to implement it?

Hi all,
I would like to run Crawlee (actually PlaywrightCrawler) all the time, even when no requests in the request queue. (Crawlee will run on small Ubuntu box in datacenter. I can handle all the devops work needed for this).
The requests/URLs should come from an external (running outside of Nodejs process) message queue. The Nodejs API to read from the external message queue exists.
The scraping results should be stored in the same external message queue.

In this configuration Crawlee is controlled by the external message queue providing URLs to scrape,
so no such things like breadth/depth crawling, no crawling at all - just scrape the provided URL and
return result

After reading this forum and some experiments my plan is:

1.
I should implement (can I subclass existing implementation?) the AutoscaledPool.
The AutoscaledPool.isFinishedFunction() should return "false" - the crawler will run as daemon
even when no messages in the Crawlee request queue

2.
Somewhere (where???) I should poll the external message queue, get the URL and call crawler.addRequests()

3.
Somewhere at the end of requestHandler() - instead of Dataset.pushData - I should write
the results back into the external message queue.

4. May be there are some other hidden problems? Would be great to know about it in advance )))

P.S. This is my first attempt to write JS/TS code, I have Java background,
so I might ask one or two strange JS-related questions, be prepared )))

P.P.S. It seems, what I want to do is more or less similar to this:
Has anyone found a solution to run Crawlee inside a Rest API on demand?

External request queue + external result storage, Crawlee as daemon process - how to implement it?

Similar Threads