2 replies

Crawlee Hybrid Crawler?

I notice a lot of the time I end up writing the exact same type of crawler where it first uses CheerioCrawler and then falls back to PlaywrightCrawler for failed requests. The only annoying thing is the obviously different syntax between cheerio and playwright ($ and load for Cheerio vs page for Playwright). For code reuse purposes i end up writing a lot of code that looks like this

...(crawlerType === 'playwright' ? { launchContext: getLaunchContext() } : {}),

...(crawlerType === 'playwright' ? { launchContext: getLaunchContext() } : {}),

Or like:

if (crawlerType === 'cheerio') {
                request.headers = headers;
            } else { // playwright crawler
                // Set headers in Playwright context
                await page.setExtraHTTPHeaders(headers);

if (crawlerType === 'cheerio') {
                request.headers = headers;
            } else { // playwright crawler
                // Set headers in Playwright context
                await page.setExtraHTTPHeaders(headers);

And it got me thinking, why doesn't Crawlee have a generalized crawler for this exact purpose? Similar to your adaptive crawler but less opaque. I cant tell why or when that adaptive crawler will use cheerio. I want ALL requests to start on cheerio and only failed ones (failed based on my crawling logic that I expect to be present in the page) to go to Playwright. Thanks!

Crawlee Hybrid Crawler?

Similar Threads