typical-coral
typical-coral2y ago

The function in node_modules "teardown" is not being called (it's in an infinite waiting state).

I deployed an AWS Lambda application (in this example, it's a test one). When I run the Lambda, the application works fine. The data is being scraped successfully, and the scraped data is logged, but however, when there are no more jobs (links in the queue), the Lambda doesn't return anything and times out after 30 seconds. Scraping takes no more than 6 seconds. I debugged the node_modules folder and found that for some reason, the "await this.teardown()" method is not being invoked in the lambda. Additionally, the logs (which I set up inside this function) are not being called either, and consequently after this function too. When running locally, everything works perfectly.
4 Replies
typical-coral
typical-coralOP2y ago
My code:
import { PuppeteerCrawler, Configuration } from 'crawlee';
import puppeteer from 'puppeteer-core';
import chromium from '@sparticuz/chromium';

export const testScraper = async (_event: any) => {
const startUrls = [
'https://crawlee.dev/docs/introduction/crawling'
];
console.log(' LAUNCH FUNCTION ');
console.log({ startUrls });
const crawler = new PuppeteerCrawler(
{
requestHandler: async ({ request, page }) => {
console.log(`Processing ${request.url}...`);
const name = await page.$eval(
'header h1',
(element: any) => {
return element.textContent;
},
);

console.log('Job result', { name });
console.log(' FINISH HANDLER ');
},

launchContext: {
// useIncognitoPages: true,
launcher: puppeteer,
launchOptions: {
executablePath: await chromium.executablePath(),
args: [...chromium.args, '--no-sandbox', '--disable-setuid-sandbox'],
headless: true,
defaultViewport: chromium.defaultViewport,
ignoreHTTPSErrors: true,
},
},
},
new Configuration({
persistStorage: false,
}),
);

console.log(' START ');
await crawler.run(startUrls);
console.log(' FINISHED ');

return {
statusCode: 200,
body: 'SUCCESS',
};
};
import { PuppeteerCrawler, Configuration } from 'crawlee';
import puppeteer from 'puppeteer-core';
import chromium from '@sparticuz/chromium';

export const testScraper = async (_event: any) => {
const startUrls = [
'https://crawlee.dev/docs/introduction/crawling'
];
console.log(' LAUNCH FUNCTION ');
console.log({ startUrls });
const crawler = new PuppeteerCrawler(
{
requestHandler: async ({ request, page }) => {
console.log(`Processing ${request.url}...`);
const name = await page.$eval(
'header h1',
(element: any) => {
return element.textContent;
},
);

console.log('Job result', { name });
console.log(' FINISH HANDLER ');
},

launchContext: {
// useIncognitoPages: true,
launcher: puppeteer,
launchOptions: {
executablePath: await chromium.executablePath(),
args: [...chromium.args, '--no-sandbox', '--disable-setuid-sandbox'],
headless: true,
defaultViewport: chromium.defaultViewport,
ignoreHTTPSErrors: true,
},
},
},
new Configuration({
persistStorage: false,
}),
);

console.log(' START ');
await crawler.run(startUrls);
console.log(' FINISHED ');

return {
statusCode: 200,
body: 'SUCCESS',
};
};
typical-coral
typical-coralOP2y ago
Part of code in node_modules:
No description
No description
Lukas Krivka
Lukas Krivka2y ago
Hello, can you please copy/paste this as an Issue to https://github.com/apify/crawlee, will be easier to debug there
typical-coral
typical-coralOP2y ago
@Lukas Krivka Hello Lukas! I created the issue several weeks ago. https://github.com/apify/crawlee/issues/2261
GitHub
"Teardown" is not being called (it's in an infinite waiting state)...
Which package is this bug report for? If unsure which one to select, leave blank @crawlee/puppeteer (PuppeteerCrawler) Issue description I deployed an AWS Lambda application(in this example, it&#39...

Did you find this page helpful?