unwilling-turquoise
unwilling-turquoise16mo ago

Blocking network requests with crawlee PuppeteerCrawler

I'm trying to block network requests from specific domains within PuppeteerCrawler but can't get it to work. I'd like to run something like this:
page.on('request', (req) => {
// If the URL doesn't include our keyword, ignore it
if (req.url().includes('bouncex')) {
req.abort();
return;
};
req.continue();
});
page.on('request', (req) => {
// If the URL doesn't include our keyword, ignore it
if (req.url().includes('bouncex')) {
req.abort();
return;
};
req.continue();
});
But it has to be initiated before page.goto. I tried adding it to preNavigationHooks like so:
preNavigationHooks: [
async ({ page }, goToOptions) => {
goToOptions!.waitUntil = "networkidle2";
goToOptions!.timeout = 3600000;
await blocker.enableBlockingInPage(page);
page.on('request', (req) => {
// If the URL doesn't include our keyword, ignore it
if (req.url().includes('bouncex')) {
req.abort();
return;
};
req.continue();
});
await page.setViewport(viewportConfig);
},
],
preNavigationHooks: [
async ({ page }, goToOptions) => {
goToOptions!.waitUntil = "networkidle2";
goToOptions!.timeout = 3600000;
await blocker.enableBlockingInPage(page);
page.on('request', (req) => {
// If the URL doesn't include our keyword, ignore it
if (req.url().includes('bouncex')) {
req.abort();
return;
};
req.continue();
});
await page.setViewport(viewportConfig);
},
],
But this returns Error: Request is already handled! Is there a way to do this with PuppeteerCrawler?
3 Replies
ondro_k
ondro_k16mo ago
Hey, when you're using multiple Intercept Handlers, you need to check if a request has already been handled: if (interceptedRequest.isInterceptResolutionHandled()) return; . Take a look at this: https://pptr.dev/guides/network-interception#multiple-intercept-handlers-and-asynchronous-resolutions.
Request Interception | Puppeteer
Once request interception is enabled, every request will stall unless it's
Lukas Krivka
Lukas Krivka15mo ago
Just be aware that request interception disables cache which makes large crawls much worse performance wise
Pepa J
Pepa J15mo ago
@kennysmithnanic Also you can check blockRequest method from PuppeteerCrawlerContext:
preNavigationHooks: [
async ({ blockRequests }) => {
await blockRequests({
rlPatterns: [
'yandex.ru',
'google-analytics.com',
]
});
}
]
preNavigationHooks: [
async ({ blockRequests }) => {
await blockRequests({
rlPatterns: [
'yandex.ru',
'google-analytics.com',
]
});
}
]

Did you find this page helpful?