How to maximize throughput with concurrency settings?
I am deploying crawlee in my kubernetes cluster and first I tried not setting a maximum (nor a minimum) of concurrent tasks, but crawlee kept on taking more and more memory until the pod was killed by the cluster (took too much memory). Now I set a conservative maximum but I see resources are very underutilised in some moments depending on the page I am crawling. Is there a way I am not seeing on how to do this correctly? Or is it possible that there is a bug when determining a max amount of concurrent tasks where crawlee spawns too many of them?
2 Replies
@Eric just advanced to level 1! Thanks for your contributions! 🎉
There is an issue dedicated to optimizing concurrency and the fact that the crawler does not reach optimal concurrency - https://github.com/apify/crawlee-python/issues/1224. But it is currently in the backlog.
but crawlee kept on taking more and more memory until the pod was killed by the cluster (took too much memory)But this seems like an error, as Crawlee should not scale if the RAM load is 80%. Although the problem may be in your configuration. Could you create an issue with the details?