deep-jade
deep-jade•3y ago

crawlee eating memory like hell

It is eating 3 GB after running just for 2 days
No description
15 Replies
deep-jade
deep-jadeOP•3y ago
i think this is the issue
deep-jade
deep-jadeOP•3y ago
No description
deep-jade
deep-jadeOP•3y ago
I am creating new url every time and it is caching them if i don't do that then it won't scrap duplicate url but i want to scrap time to time i asked fix for this issue but no response last time would be very helpful if you can help me out this time 🙂
apparent-cyan
apparent-cyan•3y ago
I'm quite new to this stuff too. But I did a load of digging around and stuff. I can't confim if this all works yet as I'm still experimenting. But I've moved the adding of the urls into a RequestQueue. This is something you can pass into the crawler when you call the constructor on it. You can call open on it statically and then also assign it an id. This will now put it in storage under the ID and not default. It looks like so
const id = `${parsedUrl.host}_${Date.now()}`;
const requestQueue = await RequestQueue.open(id);
const id = `${parsedUrl.host}_${Date.now()}`;
const requestQueue = await RequestQueue.open(id);
Pass that to the crawler. Now I can run multiple queue of the same domain multiple times. RequestQueue also has a drop method which is suppose to remove it from the storage for you. (however I'm not entirely certain I've got this working completely yet) I'm calling my stored RequestQueue in a requestQueue variable. Once the crawler has finsihed running. I call requestQueue.drop() hopefully this is tidying up the storage but I'm currently still not totally sure. I hope this helps you out a bit!
await crawler.run();
// removes it from storage
await requestQueue.drop();
await crawler.run();
// removes it from storage
await requestQueue.drop();
Lukas Krivka
Lukas Krivka•3y ago
@CTK WARRIOR This is not enough context to debug this. The code you shared looks fine, the request objects themselves are small unless you hold something big in userData
deep-jade
deep-jadeOP•3y ago
Yeah I am passing product info in userData which contains name, url and few other details i am creating new crawlee object for each turn
Lukas Krivka
Lukas Krivka•3y ago
I don't see Crawlee itself to be memory hungry. The sources of memory usage is usually browsers/cheerio parsing or user data
deep-jade
deep-jadeOP•3y ago
How can I like clean the memory? tried this but no change after running for few hours, it starts to eat alot of ram
Lukas Krivka
Lukas Krivka•3y ago
You need to delete the references to objects you don't need. There is nothing leaking in Crawlee specifically.
deep-jade
deep-jadeOP•3y ago
here is my full code
deep-jade
deep-jadeOP•3y ago
@Lukas Krivka can you taka a glance at my code and let me know if anything is wrong sorry for ping
Lukas Krivka
Lukas Krivka•3y ago
So I assume the memory is not getting clearer after each cron?
deep-jade
deep-jadeOP•3y ago
yeah yeah its 500 MB after 17 hours and after 2-3 days, its 1 GB
optimistic-gold
optimistic-gold•3y ago
Perhaps you must use profiling tools to watch memory allocation and find memory leaks. To help you detect memory leaks, you can use this fork of Memwatch (https://github.com/airbnb/node-memwatch). This module is useful because it can emit leak events if it sees the heap grow over 5 consecutive garbage collections. Clinic.js (https://clinicjs.org/) is another tool to help diagnose and pinpoint Node.js performance issues.
deep-jade
deep-jadeOP•3y ago
No lead so far The only option I have is using child process to run crawlee

Did you find this page helpful?