Apify Discord Mirror

Updated 5 months ago

Memory Leak Issue with High Volume Crawling Using Playwright

At a glance
The community member is experiencing a memory management issue with their high-volume web crawling application using Playwright. After processing around 2000 URLs, the memory usage spikes by 400%, reaching around 800MB, and Playwright becomes unresponsive. The community member has provided details about their Playwright setup, including configurations for the autoscaled pool, browser pool, and request handling. Despite ensuring that pages are closed after each crawl, the memory leak issue persists, and the community member is seeking insights and suggestions on how to troubleshoot and resolve this problem.
Hello Playwright Community,

I am currently experiencing a challenging issue with memory management in a high-volume web crawling application using Playwright. Our application is designed to scan and process thousands of web pages. However, I've noticed a significant increase in memory usage after processing approximately 2000 URLs.

Here's a brief overview of our Playwright setup:

Plain Text
new PlaywrightCrawler({
      autoscaledPoolOptions: {
        autoscaleIntervalSecs: 5, 
        loggingIntervalSecs: null,
        maxConcurrency: CONFIG.SOURCE_MAX_CONCURRENCY, // here 6
        minConcurrency: CONFIG.SOURCE_MIN_CONCURRENCY, // here 1 
      },
      browserPoolOptions: {
        operationTimeoutSecs: 5, 
        retireBrowserAfterPageCount: 10, 
        maxOpenPagesPerBrowser: 5,
        closeInactiveBrowserAfterSecs: 3,
      },
      launchContext: {
        launchOptions: {
          chromiumSandbox: false,
          headless: true, 
        },
      },
      requestHandlerTimeoutSecs: 60, 
      maxRequestRetries: 3,
      keepAlive: true, // Keeps the crawler alive even if all requests are handled; useful for long-running crawls
      retryOnBlocked: false, // Automatically retries a request if it is identified as blocked (e.g., by bot detection)
      requestHandler: this.requestHandler.bind(this), // Function to handle each request
      failedRequestHandler: this.failedRequestHandler.bind(this), // Function to handle each failed request
    })


Despite ensuring that pages are closed after each crawl, the memory usage spikes by around 400% (increasing to roughly 800MB) and then Playwright becomes unresponsive. This behavior is puzzling as we've taken care to manage resources efficiently.

I am looking for insights or suggestions on how to troubleshoot and resolve this memory leak issue. Specifically:
Add a reply
Sign up and join the conversation on Discord