xenial-black
xenial-black•3y ago

Retry using the browser

How to make it so that first try to scrap using CheerioCrawler and if the response is 403 or 401 then try PuppeteerCrawler again.
5 Replies
Lukas Krivka
Lukas Krivka•3y ago
The easiest is probably just to push the failed requests to an array on the side and then run the PupppeteerCrawler after. You can have more crawlers inside single script
xenial-black
xenial-blackOP•3y ago
How push the failed requests?
MEE6
MEE6•3y ago
@Romja just advanced to level 2! Thanks for your contributions! 🎉
xenial-black
xenial-blackOP•3y ago
How do you like the idea of doing this?
import { PuppeteerCrawler, ProxyConfiguration, Dataset } from 'crawlee';
import * as cheerio from 'cheerio';

const crawler = new PuppeteerCrawler({
async requestHandler({ request, sendRequest, parseWithCheerio }) {
if (request.skipNavigation) {
const { statusCode, body } = await sendRequest();
if (statusCode === 200) {
const $ = cheerio.load(body);
const title = $('h1').text();
Dataset.pushData({ title, url: request.url });
} else {
// Maybe there is a keepDuplicateUrls option 🤔
await crawler.addRequests([{ url: request.url, useExtendedUniqueKey: true }]);
}
} else {
const $ = await parseWithCheerio();
const title = $('h1').text();
Dataset.pushData({ title, url: request.url });
}
}
});

await crawler.run([{ url: 'https://nowsecure.nl', skipNavigation: true }]);
import { PuppeteerCrawler, ProxyConfiguration, Dataset } from 'crawlee';
import * as cheerio from 'cheerio';

const crawler = new PuppeteerCrawler({
async requestHandler({ request, sendRequest, parseWithCheerio }) {
if (request.skipNavigation) {
const { statusCode, body } = await sendRequest();
if (statusCode === 200) {
const $ = cheerio.load(body);
const title = $('h1').text();
Dataset.pushData({ title, url: request.url });
} else {
// Maybe there is a keepDuplicateUrls option 🤔
await crawler.addRequests([{ url: request.url, useExtendedUniqueKey: true }]);
}
} else {
const $ = await parseWithCheerio();
const title = $('h1').text();
Dataset.pushData({ title, url: request.url });
}
}
});

await crawler.run([{ url: 'https://nowsecure.nl', skipNavigation: true }]);
Lukas Krivka
Lukas Krivka•3y ago
The best practice is to just throw an error and the crawler will retry the whole request

Did you find this page helpful?