The best way to scale browser pool on multiple machines...

At a glance

The post asks about running Crawlee in a Docker container and how to manage a cluster of machines. Community members suggest that Crawlee is designed for single-machine use, but running multiple machines is possible. To scale, they recommend splitting URLs and workloads across machines to avoid live synchronization. A community member mentions the Apify devtools-server, but another clarifies that it is for debugging, not scaling.

The discussion then focuses on dynamically adding URLs to a request queue. Community members explain that this should be done using POST/PUT requests, not GET requests. They suggest using the Apify API or SDK to store the request queue in the cloud, which allows multiple containers to access the same queue and dynamically add new requests.

Useful resources

RRomja

As I understand it, there are no problems running Сrawlee in a Docker container where browsers will work. But what if you need to create a cluster of machines. Is there a built-in browser pool management functionality running on different hosts or do you have any ideas how to do this.

12 comments

tthek1tten

Crawlee's crawlers are designed with the idea of being run on a single machine, but it is definitely more than possible to run multiple machines with a crawler running in each of them. However, things like allocating requests to each container's crawler accordingly and scaling up/down will need to be handled on your end.

LLukas Krivka

Generally, you want to split your URLs (before hand or dynamically) and give workloads to the machines. If you can avoid live synchronizing, it will save you a lot of troubles

RRomja

Can this help me in any way?
https://github.com/apify/devtools-server

RRomja

What did you mean by "avoid live synchronizing"?

LLukas Krivka

Devtools server is for debugging, not scaling
If you want to scale into multiple servers/machines, the best way to do is split the URLs so that the machines are independent on each other. Then just merge the data.

RRomja

Can you tell me how to add URL to RequestQueue using GET requests.
Like here: https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/express-screenshot.js

LLukas Krivka

Why? It must be POST/PUT request because you are passing in data

RRomja

No matter what HTTP method I use to send the URL. Is there any way to run and dynamically add URLs to the queue.

AApifyBot

just advanced to level 1! Thanks for your contributions! 🎉

tthek1tten

If the request queue is being stored in the cloud, it can be dynamically added to using the Apify API using this endpoint: https://docs.apify.com/api/v2#/reference/request-queues/request-collection/add-request

tthek1tten

If, for example, you have two different crawlers running in separate containers that are running off of the same request queue, the queue must be stored in the cloud. In the Apify SDK, this can be done by using the forceCloud option when opening a request queue. https://sdk.apify.com/api/apify/interface/OpenStorageOptions#forceCloud

tthek1tten

Then, you won't even need to directly interact with the Apify API at all, and you can just use the SDK.

For example, this request queue will be stored in the cloud (on your Apify account):

Plain Text

const myQueue = await Actor.openRequestQueue('some-name', { forceCloud: true });

And when you add a new request to it like this:

Plain Text

await myQueue.addRequest({ url: 'https://foo.com' });

That request will now be available for processing to any other containers that have also accessed the request queue with the name some-name.

Add a reply

Apify Discord Mirror

The best way to scale browser pool on multiple machines.