rival-black
rival-blackโ€ข16mo ago

{"time":"2024-05-20T03:04:41.809Z","level":"WARNING","msg":"PuppeteerCrawler:AutoscaledPool:Snapshot

This error is happening consistently, even while only running 1 browser. When I load up the server and look at top. There are a bunch of long-running chrome processes that haven't been killed. top attached.: Error:
{"time":"2024-05-20T03:04:41.809Z","level":"WARNING","msg":"PuppeteerCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 16268 MB of 14071 MB (116%). Consider increasing available memory.","scraper":"web","url":"https://www.natronacounty-wy.gov/845/LegalPublic-Notices","place_id":"65a603fac769fa16f6596a8f"}
{"time":"2024-05-20T03:04:41.809Z","level":"WARNING","msg":"PuppeteerCrawler:AutoscaledPool:Snapshotter: Memory is critically overloaded. Using 16268 MB of 14071 MB (116%). Consider increasing available memory.","scraper":"web","url":"https://www.natronacounty-wy.gov/845/LegalPublic-Notices","place_id":"65a603fac769fa16f6596a8f"}
No description
10 Replies
rival-black
rival-blackOPโ€ข16mo ago
That top is with zero browsers currently running. cc @NeoNomade @microworlds
NeoNomade
NeoNomadeโ€ข16mo ago
@bmax to debug this, the routes would be needed For example even if you have await page.close at the end of each handler, but you have some process in the handler that hangs. It can lead to this . It's hard to debug without the content of the routes
continuing-cyan
continuing-cyanโ€ข16mo ago
@bmax I see you're using node js. I would suggest that you kill all active running browsers/child-processes (page.close() and browser.close() are not enough especially when the script hangs). When you launch a browser, get it's process id (browser.pid) and manually kill that process when you're done with the browser. You can use this library - https://www.npmjs.com/package/tree-kill. So instead of browser.close(), do:
const kill = require('tree-kill');
const browserPid = browser.pid
kill(browserPid);
const kill = require('tree-kill');
const browserPid = browser.pid
kill(browserPid);
Use with caution though, only kill the process when you completely don't need the browser ๐Ÿ˜…
continuing-cyan
continuing-cyanโ€ข16mo ago
Another option is to use a very obselete library https://github.com/thomasdondorf/puppeteer-cluster. You can control the cuncurrency and it efficiently manages all the browers/pages running on the server. See example of running an express server with browsers on it - https://github.com/thomasdondorf/puppeteer-cluster/blob/master/examples/express-screenshot.js ๐Ÿ‘ PS: this library is not maintained but for the most part, it gets the job done. ๐Ÿ˜€
GitHub
puppeteer-cluster/examples/express-screenshot.js at master ยท thomas...
Puppeteer Pool, run a cluster of instances in parallel - thomasdondorf/puppeteer-cluster
rival-black
rival-blackOPโ€ข16mo ago
zzz ๐Ÿ’ค @microworlds thanks for checking. How do you get the browser.pid from the BrowserPool within crawlee?
NeoNomade
NeoNomadeโ€ข16mo ago
This solution is against the Browser pool ๐Ÿคฃ
rival-black
rival-blackOPโ€ข16mo ago
lmao -- I agree. I'm thinking crawlee should help manage this but gotta do what you gotta do.
NeoNomade
NeoNomadeโ€ข16mo ago
I'm 100% sure the issue lays in routes. I crawled 10 million urls with a single crawler without this issue ๐Ÿคฃ But tweaked the routes to be as memory efficient as possible
continuing-cyan
continuing-cyanโ€ข16mo ago
Ah I see. Non of the examples I gave above uses Crawlee. Probably not suitable for your use case but I've been using this in several actors (that run on a VPS) in production. But as you're using Crawlee, then I'd recommend lowering the concurrency, until you find the OPTIMAL performance settings. This will allow Crawlee gracefully handle the browsers regardless of spawned instance.
Lukas Krivka
Lukas Krivkaโ€ข15mo ago
The pages are probably super heavy so the Crawlee concurrency is not able to keep up the memory under the limit. Maybe you could slow down the scaling. If you want to dig in, it would be better to send a reproduction to Crawlee GitHub issue. Ideally the log with how the current and desired concurrency is changing.

Did you find this page helpful?