rubber-blue

Memory Leak Issue with High Volume Crawling Using Playwright

Hello Playwright Community,

I am currently experiencing a challenging issue with memory management in a high-volume web crawling application using Playwright. Our application is designed to scan and process thousands of web pages. However, I've noticed a significant increase in memory usage after processing approximately 2000 URLs.

Here's a brief overview of our Playwright setup:

new PlaywrightCrawler({
      autoscaledPoolOptions: {
        autoscaleIntervalSecs: 5, 
        loggingIntervalSecs: null,
        maxConcurrency: CONFIG.SOURCE_MAX_CONCURRENCY, // here 6
        minConcurrency: CONFIG.SOURCE_MIN_CONCURRENCY, // here 1 
      },
      browserPoolOptions: {
        operationTimeoutSecs: 5, 
        retireBrowserAfterPageCount: 10, 
        maxOpenPagesPerBrowser: 5,
        closeInactiveBrowserAfterSecs: 3,
      },
      launchContext: {
        launchOptions: {
          chromiumSandbox: false,
          headless: true, 
        },
      },
      requestHandlerTimeoutSecs: 60, 
      maxRequestRetries: 3,
      keepAlive: true, // Keeps the crawler alive even if all requests are handled; useful for long-running crawls
      retryOnBlocked: false, // Automatically retries a request if it is identified as blocked (e.g., by bot detection)
      requestHandler: this.requestHandler.bind(this), // Function to handle each request
      failedRequestHandler: this.failedRequestHandler.bind(this), // Function to handle each failed request
    })

new PlaywrightCrawler({
      autoscaledPoolOptions: {
        autoscaleIntervalSecs: 5, 
        loggingIntervalSecs: null,
        maxConcurrency: CONFIG.SOURCE_MAX_CONCURRENCY, // here 6
        minConcurrency: CONFIG.SOURCE_MIN_CONCURRENCY, // here 1 
      },
      browserPoolOptions: {
        operationTimeoutSecs: 5, 
        retireBrowserAfterPageCount: 10, 
        maxOpenPagesPerBrowser: 5,
        closeInactiveBrowserAfterSecs: 3,
      },
      launchContext: {
        launchOptions: {
          chromiumSandbox: false,
          headless: true, 
        },
      },
      requestHandlerTimeoutSecs: 60, 
      maxRequestRetries: 3,
      keepAlive: true, // Keeps the crawler alive even if all requests are handled; useful for long-running crawls
      retryOnBlocked: false, // Automatically retries a request if it is identified as blocked (e.g., by bot detection)
      requestHandler: this.requestHandler.bind(this), // Function to handle each request
      failedRequestHandler: this.failedRequestHandler.bind(this), // Function to handle each failed request
    })

Despite ensuring that pages are closed after each crawl, the memory usage spikes by around 400% (increasing to roughly 800MB) and then Playwright becomes unresponsive. This behavior is puzzling as we've taken care to manage resources efficiently.

I am looking for insights or suggestions on how to troubleshoot and resolve this memory leak issue. Specifically:

Apify & Crawlee•3y ago

rubber-blue

Memory Leak Issue with High Volume Crawling Using Playwright

new PlaywrightCrawler({
      autoscaledPoolOptions: {
        autoscaleIntervalSecs: 5, 
        loggingIntervalSecs: null,
        maxConcurrency: CONFIG.SOURCE_MAX_CONCURRENCY, // here 6
        minConcurrency: CONFIG.SOURCE_MIN_CONCURRENCY, // here 1 
      },
      browserPoolOptions: {
        operationTimeoutSecs: 5, 
        retireBrowserAfterPageCount: 10, 
        maxOpenPagesPerBrowser: 5,
        closeInactiveBrowserAfterSecs: 3,
      },
      launchContext: {
        launchOptions: {
          chromiumSandbox: false,
          headless: true, 
        },
      },
      requestHandlerTimeoutSecs: 60, 
      maxRequestRetries: 3,
      keepAlive: true, // Keeps the crawler alive even if all requests are handled; useful for long-running crawls
      retryOnBlocked: false, // Automatically retries a request if it is identified as blocked (e.g., by bot detection)
      requestHandler: this.requestHandler.bind(this), // Function to handle each request
      failedRequestHandler: this.failedRequestHandler.bind(this), // Function to handle each failed request
    })

new PlaywrightCrawler({
      autoscaledPoolOptions: {
        autoscaleIntervalSecs: 5, 
        loggingIntervalSecs: null,
        maxConcurrency: CONFIG.SOURCE_MAX_CONCURRENCY, // here 6
        minConcurrency: CONFIG.SOURCE_MIN_CONCURRENCY, // here 1 
      },
      browserPoolOptions: {
        operationTimeoutSecs: 5, 
        retireBrowserAfterPageCount: 10, 
        maxOpenPagesPerBrowser: 5,
        closeInactiveBrowserAfterSecs: 3,
      },
      launchContext: {
        launchOptions: {
          chromiumSandbox: false,
          headless: true, 
        },
      },
      requestHandlerTimeoutSecs: 60, 
      maxRequestRetries: 3,
      keepAlive: true, // Keeps the crawler alive even if all requests are handled; useful for long-running crawls
      retryOnBlocked: false, // Automatically retries a request if it is identified as blocked (e.g., by bot detection)
      requestHandler: this.requestHandler.bind(this), // Function to handle each request
      failedRequestHandler: this.failedRequestHandler.bind(this), // Function to handle each failed request
    })

Memory Leak Issue with High Volume Crawling Using Playwright

Similar Threads

Memory Leak Issue with High Volume Crawling Using Playwright

Similar Threads

Similar Threads

Similar Threads