Prevent Clawler from adding failed request to default R...

At a glance

The community member is using a PuppeteerCrawler to scrape product URLs, and they want to prevent failed requests from being added to the default RequestQueue. They are purposely throwing an error when a request fails, expecting the failed request to go back to the RequestList, but instead it is being added to the default RequestQueue, which is not the desired behavior. The comments suggest that the community member should not throw an error if they do not want to retry the same request, and instead implement scraping logic to retry on errors to resolve blocking by retries.

nnathanist

Is there a way to prevent the crawler from adding a failed request to the default RequestQueue?

Plain Text

const crawler = new PuppeteerCrawler({
    proxyConfiguration,
    requestHandler: router,
    maxRequestRetries: 25,
    requestList: await RequestList.open(null, [initUrl]),
    requestHandlerTimeoutSecs: 2000,
    maxConcurrency: 1,
}, config);

I'm using the default RequestQueue to add productUrls, and they're being handled inside the defaultRequestHandler, but when some of them fails, I purposely throw an Error, expecting the failed request(which is the initUrl) goes back to RequestList, but it goes to the default RequestQueue too, which is not what I want.

2 comments

nnathanist

Prevent Clawler from adding failed request to default RequestQueue

AAlexey Udovydchenko

do not throw error if you do not want to retry the same request, scraping logic to retry on errors s to resolve blocking by retries

Add a reply

Apify Discord Mirror

Prevent Clawler from adding failed request to default RequestQueue