deep-jade•3y ago

crawlee eating memory like hell

It is eating 3 GB after running just for 2 days

15 Replies

deep-jadeOP•3y ago

i think this is the issue

deep-jadeOP•3y ago

deep-jadeOP•3y ago

I am creating new url every time and it is caching them if i don't do that then it won't scrap duplicate url but i want to scrap time to time i asked fix for this issue but no response last time would be very helpful if you can help me out this time 🙂

apparent-cyan•3y ago

I'm quite new to this stuff too. But I did a load of digging around and stuff. I can't confim if this all works yet as I'm still experimenting. But I've moved the adding of the urls into a RequestQueue. This is something you can pass into the crawler when you call the constructor on it. You can call open on it statically and then also assign it an id. This will now put it in storage under the ID and not default. It looks like so

const id = `${parsedUrl.host}_${Date.now()}`;
const requestQueue = await RequestQueue.open(id);

const id = `${parsedUrl.host}_${Date.now()}`;
const requestQueue = await RequestQueue.open(id);

Pass that to the crawler. Now I can run multiple queue of the same domain multiple times. RequestQueue also has a drop method which is suppose to remove it from the storage for you. (however I'm not entirely certain I've got this working completely yet) I'm calling my stored RequestQueue in a requestQueue variable. Once the crawler has finsihed running. I call requestQueue.drop() hopefully this is tidying up the storage but I'm currently still not totally sure. I hope this helps you out a bit!

await crawler.run();
// removes it from storage
await requestQueue.drop();

await crawler.run();
// removes it from storage
await requestQueue.drop();

Lukas Krivka•3y ago

@CTK WARRIOR This is not enough context to debug this. The code you shared looks fine, the request objects themselves are small unless you hold something big in userData

deep-jadeOP•3y ago

Yeah I am passing product info in userData which contains name, url and few other details i am creating new crawlee object for each turn

Lukas Krivka•3y ago

I don't see Crawlee itself to be memory hungry. The sources of memory usage is usually browsers/cheerio parsing or user data

deep-jadeOP•3y ago

How can I like clean the memory? tried this but no change after running for few hours, it starts to eat alot of ram

Lukas Krivka•3y ago

You need to delete the references to objects you don't need. There is nothing leaking in Crawlee specifically.

deep-jadeOP•3y ago

here is my full code

target.js

deep-jadeOP•3y ago

@Lukas Krivka can you taka a glance at my code and let me know if anything is wrong sorry for ping

Lukas Krivka•3y ago

So I assume the memory is not getting clearer after each cron?

deep-jadeOP•3y ago

yeah yeah its 500 MB after 17 hours and after 2-3 days, its 1 GB

optimistic-gold•3y ago

Perhaps you must use profiling tools to watch memory allocation and find memory leaks. To help you detect memory leaks, you can use this fork of Memwatch (https://github.com/airbnb/node-memwatch). This module is useful because it can emit leak events if it sees the heap grow over 5 consecutive garbage collections. Clinic.js (https://clinicjs.org/) is another tool to help diagnose and pinpoint Node.js performance issues.

deep-jadeOP•3y ago

No lead so far The only option I have is using child process to run crawlee

crawlee eating memory like hell

Did you find this page helpful?