typical-coral•2y ago

The function in node_modules "teardown" is not being called (it's in an infinite waiting state).

I deployed an AWS Lambda application (in this example, it's a test one). When I run the Lambda, the application works fine. The data is being scraped successfully, and the scraped data is logged, but however, when there are no more jobs (links in the queue), the Lambda doesn't return anything and times out after 30 seconds. Scraping takes no more than 6 seconds. I debugged the node_modules folder and found that for some reason, the "await this.teardown()" method is not being invoked in the lambda. Additionally, the logs (which I set up inside this function) are not being called either, and consequently after this function too. When running locally, everything works perfectly.

4 Replies

typical-coralOP•2y ago

My code:

import {  PuppeteerCrawler, Configuration } from 'crawlee';
import puppeteer from 'puppeteer-core';
import chromium from '@sparticuz/chromium';

export const testScraper = async (_event: any) => {
  const startUrls = [
    'https://crawlee.dev/docs/introduction/crawling'
  ];
  console.log(' LAUNCH FUNCTION ');
  console.log({ startUrls });
  const crawler = new PuppeteerCrawler(
    {
      requestHandler: async ({ request, page }) => {
        console.log(`Processing ${request.url}...`);
        const name = await page.$eval(
          'header h1',
          (element: any) => {
            return element.textContent;
          },
        );
      
        console.log('Job result', { name });
        console.log(' FINISH HANDLER ');
      },

      launchContext: {
        // useIncognitoPages: true,
        launcher: puppeteer,
        launchOptions: {
          executablePath: await chromium.executablePath(), 
          args: [...chromium.args, '--no-sandbox', '--disable-setuid-sandbox'],
          headless: true,
          defaultViewport: chromium.defaultViewport,
          ignoreHTTPSErrors: true,
        },
      },
    },
    new Configuration({
      persistStorage: false,
    }),
  );

  console.log(' START ');
  await crawler.run(startUrls);
  console.log(' FINISHED ');

  return {
    statusCode: 200,
    body: 'SUCCESS',
  };
};

import {  PuppeteerCrawler, Configuration } from 'crawlee';
import puppeteer from 'puppeteer-core';
import chromium from '@sparticuz/chromium';

export const testScraper = async (_event: any) => {
  const startUrls = [
    'https://crawlee.dev/docs/introduction/crawling'
  ];
  console.log(' LAUNCH FUNCTION ');
  console.log({ startUrls });
  const crawler = new PuppeteerCrawler(
    {
      requestHandler: async ({ request, page }) => {
        console.log(`Processing ${request.url}...`);
        const name = await page.$eval(
          'header h1',
          (element: any) => {
            return element.textContent;
          },
        );
      
        console.log('Job result', { name });
        console.log(' FINISH HANDLER ');
      },

      launchContext: {
        // useIncognitoPages: true,
        launcher: puppeteer,
        launchOptions: {
          executablePath: await chromium.executablePath(), 
          args: [...chromium.args, '--no-sandbox', '--disable-setuid-sandbox'],
          headless: true,
          defaultViewport: chromium.defaultViewport,
          ignoreHTTPSErrors: true,
        },
      },
    },
    new Configuration({
      persistStorage: false,
    }),
  );

  console.log(' START ');
  await crawler.run(startUrls);
  console.log(' FINISHED ');

  return {
    statusCode: 200,
    body: 'SUCCESS',
  };
};

typical-coralOP•2y ago

Part of code in node_modules:

Lukas Krivka•2y ago

Hello, can you please copy/paste this as an Issue to https://github.com/apify/crawlee, will be easier to debug there

typical-coralOP•2y ago

@Lukas Krivka Hello Lukas! I created the issue several weeks ago. https://github.com/apify/crawlee/issues/2261

GitHub

"Teardown" is not being called (it's in an infinite waiting state)...

Which package is this bug report for? If unsure which one to select, leave blank @crawlee/puppeteer (PuppeteerCrawler) Issue description I deployed an AWS Lambda application(in this example, it&#39...

The function in node_modules "teardown" is not being called (it's in an infinite waiting state).

Did you find this page helpful?