aydin
aydin3mo ago

Crawlee Hybrid Crawler?

I notice a lot of the time I end up writing the exact same type of crawler where it first uses CheerioCrawler and then falls back to PlaywrightCrawler for failed requests. The only annoying thing is the obviously different syntax between cheerio and playwright ($ and load for Cheerio vs page for Playwright). For code reuse purposes i end up writing a lot of code that looks like this
...(crawlerType === 'playwright' ? { launchContext: getLaunchContext() } : {}),
...(crawlerType === 'playwright' ? { launchContext: getLaunchContext() } : {}),
Or like:
if (crawlerType === 'cheerio') {
request.headers = headers;
} else { // playwright crawler
// Set headers in Playwright context
await page.setExtraHTTPHeaders(headers);
if (crawlerType === 'cheerio') {
request.headers = headers;
} else { // playwright crawler
// Set headers in Playwright context
await page.setExtraHTTPHeaders(headers);
And it got me thinking, why doesn't Crawlee have a generalized crawler for this exact purpose? Similar to your adaptive crawler but less opaque. I cant tell why or when that adaptive crawler will use cheerio. I want ALL requests to start on cheerio and only failed ones (failed based on my crawling logic that I expect to be present in the page) to go to Playwright. Thanks!
2 Replies
aydin
aydinOP3mo ago
Also tracking via this Github FR here: https://github.com/apify/crawlee/issues/3155
GitHub
Crawlee Better Hybrid Crawler? · Issue #3155 · apify/crawlee
Which package is the feature request for? If unsure which one to select, leave blank Crawlee Feature I notice a lot of the time I end up writing the exact same type of crawler where it first uses C...
Oleg V.
Oleg V.3mo ago
Thanks for idea. The ticket is already in the repo, so guys should check it soon.

Did you find this page helpful?