Apify Discord Mirror

Updated 5 months ago

The best way to scale browser pool on multiple machines.

At a glance

The post asks about running Crawlee in a Docker container and how to manage a cluster of machines. Community members suggest that Crawlee is designed for single-machine use, but running multiple machines is possible. To scale, they recommend splitting URLs and workloads across machines to avoid live synchronization. A community member mentions the Apify devtools-server, but another clarifies that it is for debugging, not scaling.

The discussion then focuses on dynamically adding URLs to a request queue. Community members explain that this should be done using POST/PUT requests, not GET requests. They suggest using the Apify API or SDK to store the request queue in the cloud, which allows multiple containers to access the same queue and dynamically add new requests.

Useful resources
As I understand it, there are no problems running Сrawlee in a Docker container where browsers will work. But what if you need to create a cluster of machines. Is there a built-in browser pool management functionality running on different hosts or do you have any ideas how to do this.
1
t
L
R
12 comments
Crawlee's crawlers are designed with the idea of being run on a single machine, but it is definitely more than possible to run multiple machines with a crawler running in each of them. However, things like allocating requests to each container's crawler accordingly and scaling up/down will need to be handled on your end.
Generally, you want to split your URLs (before hand or dynamically) and give workloads to the machines. If you can avoid live synchronizing, it will save you a lot of troubles
What did you mean by "avoid live synchronizing"?

  1. Devtools server is for debugging, not scaling
  2. If you want to scale into multiple servers/machines, the best way to do is split the URLs so that the machines are independent on each other. Then just merge the data.
Can you tell me how to add URL to RequestQueue using GET requests.
Like here: https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/express-screenshot.js
Why? It must be POST/PUT request because you are passing in data
No matter what HTTP method I use to send the URL. Is there any way to run and dynamically add URLs to the queue.
just advanced to level 1! Thanks for your contributions! 🎉
If the request queue is being stored in the cloud, it can be dynamically added to using the Apify API using this endpoint: https://docs.apify.com/api/v2#/reference/request-queues/request-collection/add-request
If, for example, you have two different crawlers running in separate containers that are running off of the same request queue, the queue must be stored in the cloud. In the Apify SDK, this can be done by using the forceCloud option when opening a request queue. https://sdk.apify.com/api/apify/interface/OpenStorageOptions#forceCloud
Then, you won't even need to directly interact with the Apify API at all, and you can just use the SDK.

For example, this request queue will be stored in the cloud (on your Apify account):
Plain Text
const myQueue = await Actor.openRequestQueue('some-name', { forceCloud: true });


And when you add a new request to it like this:
Plain Text
await myQueue.addRequest({ url: 'https://foo.com' });


That request will now be available for processing to any other containers that have also accessed the request queue with the name some-name.
Add a reply
Sign up and join the conversation on Discord