sensitive-blue
sensitive-blue2y ago

Need serious help scaling crawlee

I have an ECS instance with 4vCPU & 16gb RAM. My scaling options are the following:
maxConcurrency: 200,
maxRequestsPerCrawl: 500,
maxRequestRetries: 2,
requestHandlerTimeoutSecs: 185,
maxConcurrency: 200,
maxRequestsPerCrawl: 500,
maxRequestRetries: 2,
requestHandlerTimeoutSecs: 185,
I am starting 4 of these crawlers at a time. Here is a snapshot log:
{"time":"2024-04-15T00:09:08.818Z","level":"INFO","msg":"PuppeteerCrawler:AutoscaledPool: state","currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.106},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":1},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
{"time":"2024-04-15T00:09:08.818Z","level":"INFO","msg":"PuppeteerCrawler:AutoscaledPool: state","currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0.106},"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":1},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
Can anyone help me identify the correct settings so it is not maxed out?
17 Replies
Saurav Jain
Saurav Jain2y ago
Our team will reply you soon!
Pepa J
Pepa J2y ago
Hi @bmax Did you set the right env variable for memory allocation ? https://crawlee.dev/docs/guides/configuration#crawlee_memory_mbytes
sensitive-blue
sensitive-blueOP2y ago
@Pepa J do you think that is what's missing here? I do set CRAWLEE_AVAILABLE_MEMORY_RATIO=.80 but how come currency is 1 when I have it set to max of 200?
Pepa J
Pepa J2y ago
@bmax It could be many things, like requests being enqueued one by one, I cannot really tell without seeing the code. Does the code work for you as expected when running locally? @bmax by starting 4 of these crawlers, you mean like 4 separate processes? And it can take 0.8 * 16GB = 12.8GB => 4 * 12.8GB = 51.2GB total memory?
sensitive-blue
sensitive-blueOP2y ago
No this is one instance of node on one ec2 instance. I start 4 of them asynchronously I.e .run() with separate queues. Thanks for diving deep with me. Happy to answer any other questions. It feels kind of like trial and error to me but I guess I really don’t understand why concurrency isn’t hitting at least double digits
Pepa J
Pepa J2y ago
So what happens if you try to run it with env variable set to CRAWLEE_MEMORY_MBYTES=13107 which is about 80% of 16GB?
sensitive-blue
sensitive-blueOP2y ago
@Pepa J Memory doesn't seem to be the biggest issue here, CPU is the one maxing out, but, I guess I don't understand the PuppeteerCrawler:AutoscaledPool: state log well enough to know what the bottle neck is. but happy to try that if you think that is the problem. I figured it was CPU becuase: ,"cpuInfo":{"isOverloaded":true,"limitRatio":0.4,"actualRatio":1},
Pepa J
Pepa J2y ago
@bmax When you run it locally everything works properly?
sensitive-blue
sensitive-blueOP2y ago
will test in a few. i also have a huge cpu/mem
Pepa J
Pepa J2y ago
I would like to distinguish if it is resources, website or implementation related issue. 🙂
sensitive-blue
sensitive-blueOP2y ago
running on local which basically has unlimited cpu/mem. it seems pretty fast but the log that it returns is still single digits for conurrency,
","currentConcurrency":7,"desiredConcurrency":8,
","currentConcurrency":7,"desiredConcurrency":8,
here's an interesting thing from the ec2 instance:
{"time":"2024-04-15T01:36:05.694Z","level":"INFO","msg":"PuppeteerCrawler: Final request statistics:","scraper":"web","url":"https://www.banks.k12.ga.us/apps/news/category/19174?pageIndex=6","place_id":"65a603fbc769fa16f659736a","requestsFinished":501,"requestsFailed":0,"retryHistogram":[501],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":6102,"requestsFinishedPerMinute":18,"requestsFailedPerMinute":0,"requestTotalDurationMillis":3057100,"requestsTotal":501,"crawlerRuntimeMillis":1697643}
{"time":"2024-04-15T01:36:05.694Z","level":"INFO","msg":"PuppeteerCrawler: Final request statistics:","scraper":"web","url":"https://www.banks.k12.ga.us/apps/news/category/19174?pageIndex=6","place_id":"65a603fbc769fa16f659736a","requestsFinished":501,"requestsFailed":0,"retryHistogram":[501],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":6102,"requestsFinishedPerMinute":18,"requestsFailedPerMinute":0,"requestTotalDurationMillis":3057100,"requestsTotal":501,"crawlerRuntimeMillis":1697643}
501 total requests but only 18 per minute? @Pepa J can I pay you an consulting/hourly fee to check this out with me?
Pepa J
Pepa J2y ago
If there are no warnings or retries in the log then it seems like pretty huge website or deoptimization in code, we used to had some bad experiences with some 3rd party libraries that provided sync blocking sorting/transforming data.
can I pay you an consulting/hourly fee to check this out with me?
Unfortunately we do not provide such a service here on discord. I mostly advice to fill some "dangerous code" areas with logs with timestamp, so it can be determined what is taking so much time. Or you may run the Actor in headful mode with maxConcurrency: 1 locally and see if you spot anything locally.
sensitive-blue
sensitive-blueOP2y ago
Are you thinking event loop lag?
Pepa J
Pepa J2y ago
Something like that, probably not full one, since you are getting the logs, but I don't have deep knowledge there, I know that the browsers should run in separate thread, but the CDP instructions for browser should be in single thread 🤔 So first I would discover what is causing the blocking - should be easy if you can to reproduce it with maxConcurrency: 1.
sensitive-blue
sensitive-blueOP2y ago
@Pepa J
No description
sensitive-blue
sensitive-blueOP2y ago
any puppeteer settings you know of that would make chrome take up less?
launchContext: {
launcher: puppeteerExtra,
useIncognitoPages: true,
launchOptions: {
executablePath: process.env.CHROMIUM_PATH || undefined,
headless: 'new',
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--aggressive-cache-discard',
'--no-zygote',
'--disable-cache',
'--disable-application-cache',
'--disable-offline-load-stale-cache',
'--disable-gpu-shader-disk-cache',
'--disable-gpu',
'--media-cache-size=0',
'--disk-cache-size=0',
'--ignore-certificate-errors',
'--disable-dev-shm-usage',
],
launchContext: {
launcher: puppeteerExtra,
useIncognitoPages: true,
launchOptions: {
executablePath: process.env.CHROMIUM_PATH || undefined,
headless: 'new',
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--aggressive-cache-discard',
'--no-zygote',
'--disable-cache',
'--disable-application-cache',
'--disable-offline-load-stale-cache',
'--disable-gpu-shader-disk-cache',
'--disable-gpu',
'--media-cache-size=0',
'--disk-cache-size=0',
'--ignore-certificate-errors',
'--disable-dev-shm-usage',
],
Pepa J
Pepa J2y ago
@bmax I don't think you may save many resources by this... Can you just keep the website open without processing anything? Is it also taking so much cpu?

Did you find this page helpful?