rubber-blueR
Apify & Crawlee3y ago
4 replies
rubber-blue

To eliminate duplicates of "request retries," may need to set a "timeout" between them?

The issue is that when the "job" fails, it gets restarted as many times as specified in "maxRequestRetries." However, if the restarted "jobs" are successful, I end up with multiple identical results in the output, whereas I only need one.

For example: the first job fails, and it gets restarted (which is intended), but since it successfully restarts, for instance, two times, I receive two identical results. But I actually need only one result.
import { Dataset, PuppeteerCrawler, log, } from 'crawlee';

export const puppeteerCrawler = async (cbRouterHandler, links) => {
  const crawler = new PuppeteerCrawler({
    minConcurrency: 4,
    maxConcurrency: 20,
    maxRequestRetries: 3,
    requestHandlerTimeoutSecs: 30,
    headless: false,
    requestHandler: cbRouterHandler,
    preNavigationHooks: [
      async (crawlingContext, gotoOptions) => {
        gotoOptions.timeout = 15_000;
        gotoOptions.waitUntil = 'networkidle2';
      },
    ],
    failedRequestHandler({ request, error }) {
      log.error(`Request ${request.url} failed too many times.`);
    },
  });

  await crawler.run(links);

  await Dataset.exportToJSON('TEST');
};
Was this page helpful?