Apify Discord Mirror

Updated 7 months ago

Crawlee memory management

At a glance

The community member is experiencing memory issues with their Playwright crawler, where it exhausts its memory and runs slowly after a few hours. The memory usage is primarily from the Chromium instances, with 27 instances taking 50-100MB each, and the Node process taking around 500MB. The system state message indicates that the memory is critically overloaded, and the community member is unsure why the AutoScaledPool is not scaling down or clearing up the Chromium instances to improve the memory condition.

In the comments, another community member suggests that the issue may not be related to Crawlee at all, and that the community member was manually opening new Chromium contexts to handle authentication without closing them, causing them to pile up every 5 minutes. This appears to be the solution to the problem.

Additionally, the community members discuss how to implement a custom logger by extending the Crawlee log class and overwriting the 'internal' method.

Hi All,

I have a playwright crawler that after a few hours exhausts its memory and ends up going extremely slowly. I havent set up any custom logic to manage the memory and concurrency of crawlee but it was my understanding that in general AutoScaledPool should deal with it anyway?

Most of my memory usage is coming from my chromium instances. there are currently 27 of them each taking between 50 and 100MB. the node process itself is taking around 500MB.

Here is my system stste message

Plain Text
{
  "level": "info",
  "service": "AutoscaledPool",
  "message": "state",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4",
  "currentConcurrency": 1,
  "desiredConcurrency": 1,
  "systemStatus": {
    "isSystemIdle": false,
    "memInfo": {
      "isOverloaded": true,
      "limitRatio": 0.2,
      "actualRatio": 1
    },
    "eventLoopInfo": {
      "isOverloaded": false,
      "limitRatio": 0.6,
      "actualRatio": 0.019
    },
    "cpuInfo": {
      "isOverloaded": false,
      "limitRatio": 0.4,
      "actualRatio": 0
    },
    "clientInfo": {
      "isOverloaded": false,
      "limitRatio": 0.3,
      "actualRatio": 0
    }
  }
}


and here is my memory warning message

Plain Text
{
  "level": "warning",
  "service": "Snapshotter",
  "message": "Memory is critically overloaded. Using 7164 MB of 6065 MB (118%). Consider increasing available memory.",
  "id": "5b83448e57d74571921de06df2d980f2",
  "jobId": "testPayload4"
}

The PC it is running on has 24GB of RAM so the 6GB target makes sense with the default value for maxUsedMemoryRatio being 0.25. The PC also has pleanty of available ram above crawlee, sitting at about 67% usage currently.
Why isnt AutoScaledPool scaling down or otherwise clearing up chromium instances to improve its memory condition?
C
A
D
7 comments
I think i fixed it. I dont think it was anything to do with crawlee at all. Periodicly I was opening a new chromium context manually to handle authentication. I wasnt closing those contexts so they were just piling up every 5 minutes
just advanced to level 2! Thanks for your contributions! πŸŽ‰
Out of interest, how did you generate that system state message ?
Its just automatic isnt it? I will double check if i have anything special. πŸ™‚
here is my crawler config code

Plain Text
  const router = createPlaywrightRouter();
  router.addHandler(
    requestLabels.spider,
    await spiderDiscoveryHandlerFactory(container),
  );
  router.addHandler(
    requestLabels.spiderBackTrack,
    await spiderBackTrackHandlerFactory(container),
  );
  router.addHandler(
    requestLabels.article,
    await articleHandlerFactory(container),
  );
  router.addHandler(
    requestLabels.download,
    await downloadHandlerFactory(container),
  );

  const crawlerOptions: PlaywrightCrawlerOptions = {
    launchContext: {
      launcher: chromium,
    },
    requestHandler: router,
    preNavigationHooks: [
      downloadPreNavigationHookFactory(container),
      articleImageInterceptorFactory(container),
    ],
    errorHandler: errorHandlerFactory(container),
    failedRequestHandler: failedRequestHandlerFactory(container),
    maxRequestsPerCrawl:
      body.config.maxRequests > 0 ? body.config.maxRequests : undefined,
    useSessionPool: true,
    log: new cralweeLogger(logger.child('crawlee')),
    persistCookiesPerSession: true,
  };

  const storageClient = new MemoryStorage({
    localDataDirectory: `./storage/${message.messageId}`,
    writeMetadata: true,
    persistStorage: true,
  });

  const crawlerConfig = new Configuration({
    storageClient: storageClient,
    persistStateIntervalMillis: 5000,
    persistStorage: true,
    purgeOnStart: false,
    headless: true,
  });
  }

  const crawler = new PlaywrightCrawler(crawlerOptions, crawlerConfig);


the only key difference is that I made my own logger that hooked into the winston logging I have been using in the wider app
Oh I want to have my own logger. How did you implement that?
You can extend the crawlee log class, overwrite the 'internal' method (iirc) and do whatever you like
Add a reply
Sign up and join the conversation on Discord