deep-jade
deep-jade16mo ago

Memory for only 1 browser is 12gb? How to ensure clean up after pages?

Hello, do you have to manually call page.close or anything at the end of defaulthandler?
14 Replies
deep-jade
deep-jadeOP16mo ago
@Saurav Jain pls!
fascinating-indigo
fascinating-indigo16mo ago
No, the pages are closed automatically once the handler function is done executing.
deep-jade
deep-jadeOP16mo ago
@NeoNomade we were just talking about this
NeoNomade
NeoNomade16mo ago
@bmax @Hamza probably there is a designed process to do that. But on a bigger scale, is a big difference on ram usage when closing page manually and auto handling this. For spiders between 5-10k pages where I tested, the difference in ram memory and speed is huge
Lukas Krivka
Lukas Krivka15mo ago
The memory usage should scale with concurrency. If the memory is increasing but the concurrency is decreasing that can mean memory leak either in your code or in Crawlee. We need a reproduction in that case. Generally, you should be able to run crawler for days without problems
deep-jade
deep-jadeOP15mo ago
@Lukas Krivka this is completely not what's happening -- i'm getting "memInfo":{"isOverloaded":true,"limitRatio":0.2,"actualRatio":1} even without running anything. Snapshotter is always leaking. I don't know how to reproduce.
Pepa J
Pepa J15mo ago
Hi @bmax , I am sorry but without knowing the website or your code we cannot help more...
NeoNomade
NeoNomade15mo ago
@Pepa J this is related to the OS. The ratio is not working properly on all OSs. MacOS allows scaling to infinite. Alpine Linux allow scaling to infinite
Pepa J
Pepa J15mo ago
@NeoNomade If it is a bug we need to reproduce it to be able to fix it. If anyone can share reproduceable example, we may take a look, why is it happening.
NeoNomade
NeoNomade15mo ago
I'm on holidays, I'll get back to my laptop later tonight. I have hundreds of Crawlee spiders running daily on Alpine (mostly Puppeteer and Cheerio) that will try to scale infinitely. If I don't limit max concurrency, AWS batch is killing the jobs automatically . I think the easiest way to reproduce, is to take the demo spider for the crawlee page and just replace the Dockerfile with an Alpine one. Then run the container with limited resources something like 2vcpu and 4gb of ram. It will be very easy to see that autoscailing fails
Pepa J
Pepa J15mo ago
@NeoNomade You should be able to change the scaling options with https://crawlee.dev/docs/guides/scaling-crawlers but if this not a thing, then let us know more.
Scaling our crawlers | Crawlee
To infinity and beyond! ...within limits
NeoNomade
NeoNomade15mo ago
@Pepa J those parameters are fine. But : 1. By default Crawlee is saying that if I not mention maxConcurrency it will scale up to available resources and I'm telling you that on many OSs it is not able to read the resources. I have containers killed with OOM if I don't use maxConcurrency based on previous benchmarking that I've done myself. 2. On Linux max requests per minute is never working. It's outputting a bunch of logs and it gets stuck (I'll provide an example in approx 8-10 hours) 3. Also not related to this, on Linux for the moment neither the request locking works properly.
Pepa J
Pepa J15mo ago
@NeoNomade Thank you for your explanation. If you would be able to provide minimal PoC repository, where this problem occur and we may investigate this, that would help a lot.
NeoNomade
NeoNomade15mo ago
I will make a PoC and DM it to you in case I don't find this conversation again .

Did you find this page helpful?