xenial-black
xenial-black•4y ago

crawlee eating memory like hell

It is eating 3 GB after running just for 2 days
No description
15 Replies
xenial-black
xenial-blackOP•4y ago
i think this is the issue
xenial-black
xenial-blackOP•4y ago
No description
xenial-black
xenial-blackOP•4y ago
I am creating new url every time and it is caching them if i don't do that then it won't scrap duplicate url but i want to scrap time to time i asked fix for this issue but no response last time would be very helpful if you can help me out this time 🙂
adverse-sapphire
adverse-sapphire•4y ago
I'm quite new to this stuff too. But I did a load of digging around and stuff. I can't confim if this all works yet as I'm still experimenting. But I've moved the adding of the urls into a RequestQueue. This is something you can pass into the crawler when you call the constructor on it. You can call open on it statically and then also assign it an id. This will now put it in storage under the ID and not default. It looks like so
const id = `${parsedUrl.host}_${Date.now()}`;
const requestQueue = await RequestQueue.open(id);
const id = `${parsedUrl.host}_${Date.now()}`;
const requestQueue = await RequestQueue.open(id);
Pass that to the crawler. Now I can run multiple queue of the same domain multiple times. RequestQueue also has a drop method which is suppose to remove it from the storage for you. (however I'm not entirely certain I've got this working completely yet) I'm calling my stored RequestQueue in a requestQueue variable. Once the crawler has finsihed running. I call requestQueue.drop() hopefully this is tidying up the storage but I'm currently still not totally sure. I hope this helps you out a bit!
await crawler.run();
// removes it from storage
await requestQueue.drop();
await crawler.run();
// removes it from storage
await requestQueue.drop();
Lukas Krivka
Lukas Krivka•4y ago
@CTK WARRIOR This is not enough context to debug this. The code you shared looks fine, the request objects themselves are small unless you hold something big in userData
xenial-black
xenial-blackOP•4y ago
Yeah I am passing product info in userData which contains name, url and few other details i am creating new crawlee object for each turn
Lukas Krivka
Lukas Krivka•4y ago
I don't see Crawlee itself to be memory hungry. The sources of memory usage is usually browsers/cheerio parsing or user data
xenial-black
xenial-blackOP•4y ago
How can I like clean the memory? tried this but no change after running for few hours, it starts to eat alot of ram
Lukas Krivka
Lukas Krivka•4y ago
You need to delete the references to objects you don't need. There is nothing leaking in Crawlee specifically.
xenial-black
xenial-blackOP•4y ago
here is my full code
xenial-black
xenial-blackOP•4y ago
@Lukas Krivka can you taka a glance at my code and let me know if anything is wrong sorry for ping
Lukas Krivka
Lukas Krivka•4y ago
So I assume the memory is not getting clearer after each cron?
xenial-black
xenial-blackOP•4y ago
yeah yeah its 500 MB after 17 hours and after 2-3 days, its 1 GB
xenial-black
xenial-black•4y ago
Perhaps you must use profiling tools to watch memory allocation and find memory leaks. To help you detect memory leaks, you can use this fork of Memwatch (https://github.com/airbnb/node-memwatch). This module is useful because it can emit leak events if it sees the heap grow over 5 consecutive garbage collections. Clinic.js (https://clinicjs.org/) is another tool to help diagnose and pinpoint Node.js performance issues.
xenial-black
xenial-blackOP•4y ago
No lead so far The only option I have is using child process to run crawlee

Did you find this page helpful?