other-emerald
other-emerald•14mo ago

Crawlee memory management

Hi All, I have a playwright crawler that after a few hours exhausts its memory and ends up going extremely slowly. I havent set up any custom logic to manage the memory and concurrency of crawlee but it was my understanding that in general AutoScaledPool should deal with it anyway? Most of my memory usage is coming from my chromium instances. there are currently 27 of them each taking between 50 and 100MB. the node process itself is taking around 500MB. Here is my system stste message
{
"level": "info",
"service": "AutoscaledPool",
"message": "state",
"id": "5b83448e57d74571921de06df2d980f2",
"jobId": "testPayload4",
"currentConcurrency": 1,
"desiredConcurrency": 1,
"systemStatus": {
"isSystemIdle": false,
"memInfo": {
"isOverloaded": true,
"limitRatio": 0.2,
"actualRatio": 1
},
"eventLoopInfo": {
"isOverloaded": false,
"limitRatio": 0.6,
"actualRatio": 0.019
},
"cpuInfo": {
"isOverloaded": false,
"limitRatio": 0.4,
"actualRatio": 0
},
"clientInfo": {
"isOverloaded": false,
"limitRatio": 0.3,
"actualRatio": 0
}
}
}
{
"level": "info",
"service": "AutoscaledPool",
"message": "state",
"id": "5b83448e57d74571921de06df2d980f2",
"jobId": "testPayload4",
"currentConcurrency": 1,
"desiredConcurrency": 1,
"systemStatus": {
"isSystemIdle": false,
"memInfo": {
"isOverloaded": true,
"limitRatio": 0.2,
"actualRatio": 1
},
"eventLoopInfo": {
"isOverloaded": false,
"limitRatio": 0.6,
"actualRatio": 0.019
},
"cpuInfo": {
"isOverloaded": false,
"limitRatio": 0.4,
"actualRatio": 0
},
"clientInfo": {
"isOverloaded": false,
"limitRatio": 0.3,
"actualRatio": 0
}
}
}
and here is my memory warning message
{
"level": "warning",
"service": "Snapshotter",
"message": "Memory is critically overloaded. Using 7164 MB of 6065 MB (118%). Consider increasing available memory.",
"id": "5b83448e57d74571921de06df2d980f2",
"jobId": "testPayload4"
}
{
"level": "warning",
"service": "Snapshotter",
"message": "Memory is critically overloaded. Using 7164 MB of 6065 MB (118%). Consider increasing available memory.",
"id": "5b83448e57d74571921de06df2d980f2",
"jobId": "testPayload4"
}
The PC it is running on has 24GB of RAM so the 6GB target makes sense with the default value for maxUsedMemoryRatio being 0.25. The PC also has pleanty of available ram above crawlee, sitting at about 67% usage currently. Why isnt AutoScaledPool scaling down or otherwise clearing up chromium instances to improve its memory condition?
10 Replies
other-emerald
other-emeraldOP•14mo ago
I think i fixed it. I dont think it was anything to do with crawlee at all. Periodicly I was opening a new chromium context manually to handle authentication. I wasnt closing those contexts so they were just piling up every 5 minutes
MEE6
MEE6•14mo ago
@Crafty just advanced to level 2! Thanks for your contributions! 🎉
genetic-orange
genetic-orange•14mo ago
Out of interest, how did you generate that system state message @Crafty ?
other-emerald
other-emeraldOP•14mo ago
Its just automatic isnt it? I will double check if i have anything special. 🙂 here is my crawler config code
const router = createPlaywrightRouter();
router.addHandler(
requestLabels.spider,
await spiderDiscoveryHandlerFactory(container),
);
router.addHandler(
requestLabels.spiderBackTrack,
await spiderBackTrackHandlerFactory(container),
);
router.addHandler(
requestLabels.article,
await articleHandlerFactory(container),
);
router.addHandler(
requestLabels.download,
await downloadHandlerFactory(container),
);

const crawlerOptions: PlaywrightCrawlerOptions = {
launchContext: {
launcher: chromium,
},
requestHandler: router,
preNavigationHooks: [
downloadPreNavigationHookFactory(container),
articleImageInterceptorFactory(container),
],
errorHandler: errorHandlerFactory(container),
failedRequestHandler: failedRequestHandlerFactory(container),
maxRequestsPerCrawl:
body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
useSessionPool: true,
log: new cralweeLogger(logger.child('crawlee')),
persistCookiesPerSession: true,
};

const storageClient = new MemoryStorage({
localDataDirectory: `./storage/${message.messageId}`,
writeMetadata: true,
persistStorage: true,
});

const crawlerConfig = new Configuration({
storageClient: storageClient,
persistStateIntervalMillis: 5000,
persistStorage: true,
purgeOnStart: false,
headless: true,
});
}

const crawler = new PlaywrightCrawler(crawlerOptions, crawlerConfig);
const router = createPlaywrightRouter();
router.addHandler(
requestLabels.spider,
await spiderDiscoveryHandlerFactory(container),
);
router.addHandler(
requestLabels.spiderBackTrack,
await spiderBackTrackHandlerFactory(container),
);
router.addHandler(
requestLabels.article,
await articleHandlerFactory(container),
);
router.addHandler(
requestLabels.download,
await downloadHandlerFactory(container),
);

const crawlerOptions: PlaywrightCrawlerOptions = {
launchContext: {
launcher: chromium,
},
requestHandler: router,
preNavigationHooks: [
downloadPreNavigationHookFactory(container),
articleImageInterceptorFactory(container),
],
errorHandler: errorHandlerFactory(container),
failedRequestHandler: failedRequestHandlerFactory(container),
maxRequestsPerCrawl:
body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
useSessionPool: true,
log: new cralweeLogger(logger.child('crawlee')),
persistCookiesPerSession: true,
};

const storageClient = new MemoryStorage({
localDataDirectory: `./storage/${message.messageId}`,
writeMetadata: true,
persistStorage: true,
});

const crawlerConfig = new Configuration({
storageClient: storageClient,
persistStateIntervalMillis: 5000,
persistStorage: true,
purgeOnStart: false,
headless: true,
});
}

const crawler = new PlaywrightCrawler(crawlerOptions, crawlerConfig);
the only key difference is that I made my own logger that hooked into the winston logging I have been using in the wider app
genetic-orange
genetic-orange•14mo ago
Oh I want to have my own logger. How did you implement that?
other-emerald
other-emeraldOP•14mo ago
You can extend the crawlee log class, overwrite the 'internal' method (iirc) and do whatever you like
fair-rose
fair-rose•12mo ago
How can i change the maxUsedMemoryRatio? Example: PC has 16 GB and i want to give crawlee 8GB of them. I want to change the rate to 0.5
Oleg V.
Oleg V.•12mo ago
@Arthur Mendes Example:
const crawler = new CheerioCrawler({
proxyConfiguration,
requestList,
maxRequestRetries: 10,
autoscaledPoolOptions: {
scaleUpStepRatio: 0.2,
scaleDownStepRatio: 0.2,
snapshotterOptions: { maxUsedMemoryRatio: 0.95 }, // The default is 0.7.
},
requestHandler: router,
});
const crawler = new CheerioCrawler({
proxyConfiguration,
requestList,
maxRequestRetries: 10,
autoscaledPoolOptions: {
scaleUpStepRatio: 0.2,
scaleDownStepRatio: 0.2,
snapshotterOptions: { maxUsedMemoryRatio: 0.95 }, // The default is 0.7.
},
requestHandler: router,
});
fair-rose
fair-rose•12mo ago
Thank you!
other-emerald
other-emeraldOP•12mo ago
personally, im running crawlee in docker and is uses however much ram i assign to the container

Did you find this page helpful?