deep-jade•3y ago
crawlee eating memory like hell
It is eating 3 GB after running just for 2 days

15 Replies
deep-jadeOP•3y ago
i think this is the issue
deep-jadeOP•3y ago

deep-jadeOP•3y ago
I am creating new url every time and it is caching them
if i don't do that then it won't scrap duplicate url but i want to scrap time to time
i asked fix for this issue but no response last time
would be very helpful if you can help me out this time 🙂
apparent-cyan•3y ago
I'm quite new to this stuff too. But I did a load of digging around and stuff.
I can't confim if this all works yet as I'm still experimenting. But I've moved the adding of the urls into a
RequestQueue
. This is something you can pass into the crawler when you call the constructor on it. You can call open on it statically and then also assign it an id. This will now put it in storage under the ID and not default. It looks like so
Pass that to the crawler. Now I can run multiple queue of the same domain multiple times. RequestQueue
also has a drop method which is suppose to remove it from the storage for you. (however I'm not entirely certain I've got this working completely yet)
I'm calling my stored RequestQueue
in a requestQueue
variable. Once the crawler has finsihed running. I call requestQueue.drop()
hopefully this is tidying up the storage but I'm currently still not totally sure. I hope this helps you out a bit!
@CTK WARRIOR This is not enough context to debug this. The code you shared looks fine, the request objects themselves are small unless you hold something big in userData
deep-jadeOP•3y ago
Yeah I am passing product info in userData
which contains name, url and few other details
i am creating new crawlee object for each turn
I don't see Crawlee itself to be memory hungry. The sources of memory usage is usually browsers/cheerio parsing or user data
deep-jadeOP•3y ago
How can I like clean the memory?
tried this but no change
after running for few hours, it starts to eat alot of ram
You need to delete the references to objects you don't need. There is nothing leaking in Crawlee specifically.
deep-jadeOP•3y ago
here is my full code
deep-jadeOP•3y ago
@Lukas Krivka can you taka a glance at my code and let me know if anything is wrong
sorry for ping
So I assume the memory is not getting clearer after each cron?
deep-jadeOP•3y ago
yeah yeah
its 500 MB after 17 hours and after 2-3 days, its 1 GB
optimistic-gold•3y ago
Perhaps you must use profiling tools to watch memory allocation and find memory leaks.
To help you detect memory leaks, you can use this fork of Memwatch (https://github.com/airbnb/node-memwatch). This module is useful because it can emit leak events if it sees the heap grow over 5 consecutive garbage collections.
Clinic.js (https://clinicjs.org/) is another tool to help diagnose and pinpoint Node.js performance issues.
deep-jadeOP•3y ago
No lead so far
The only option I have is using child process to run crawlee