magic-amber
magic-amberβ€’2y ago

prevent downloading unneeded resources

I'd like to block tracking pixels and other unneeded files in the PlayWright crawler. It seems I can use page.route for that but in the context of PlaywrightCrawler, would I just put those in requestHandler? (Won't that be too late?)
23 Replies
magic-amber
magic-amberOPβ€’2y ago
no one? 😦
Lukas Krivka
Lukas Krivkaβ€’2y ago
Just keep in mind blocking requests like this disables cache and can get counterproductive. There is a better way https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests
playwrightUtils | API | Crawlee
A namespace that contains various utilities for Playwright - the headless Chrome Node API. Example usage: ```javascript import { launchPlaywright, playwrightUtils } from 'crawlee'; // Navigate to https://www.example.com in Playwright with a POST request const browser = await launchPlaywright(); c...
magic-amber
magic-amberOPβ€’2y ago
how would you use that in the context of PlaywrightCrawler? I assume you'd be too late calling blockRequests inside the requestHandler
robust-apricot
robust-apricotβ€’2y ago
It seems to work when I try this inside the requestHandler, but only on the initial load - images loaded after a click don't seem to be blocked?
Lukas Krivka
Lukas Krivkaβ€’2y ago
In preNavigationHooks like @cdslash mentioned.
preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
const { blockRequests } = crawlingContext;
await blockRequests();
},
]
preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
const { blockRequests } = crawlingContext;
await blockRequests();
},
]
Lukas Krivka
Lukas Krivkaβ€’2y ago
GitHub
crawlee/packages/playwright-crawler/src/internals/utils/playwright-...
Crawleeβ€”A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - apify/crawlee
Lukas Krivka
Lukas Krivkaβ€’2y ago
Chrome DevTools Protocol
Chrome DevTools Protocol - version tot - Network domain
robust-apricot
robust-apricotβ€’2y ago
It takes true FAANG engineers to write docs that tell you Network.setBlockedURLs "Blocks URLs from loading" πŸ˜‚ I was actually mistaken - it is still blocking the requests for e.g. .jpg after the mouse click, but the images are still being loaded because they're loaded through a css background image where the url doesn't show the .jpg extension so it doesn't match the glob pattern
Lukas Krivka
Lukas Krivkaβ€’2y ago
Makes sense. That's the limitation of globs.
magic-amber
magic-amberOPβ€’2y ago
Getting the blockRequests from crawlingContext results in it always erroring with ERROR PlaywrightCrawler: Request failed and reached maximum retries. ArgumentError: Cannot convert object to primitive value in object options regardless of which options are used. Getting blockRequests` from the playwrightUtils exported by crawlee does work.
robust-apricot
robust-apricotβ€’2y ago
Could you share the code you're using?
MEE6
MEE6β€’2y ago
@cdslash just advanced to level 2! Thanks for your contributions! πŸŽ‰
magic-amber
magic-amberOPβ€’2y ago
preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
const { blockRequests, page, addCookies } = crawlingContext;

if (cookies) {
await addCookies(cookies);
}

await playwrightUtils.blockRequests(page, {
urlPatterns,
});

await page.setViewportSize({
width: 1280,
height: 720,
});
},
],
preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
const { blockRequests, page, addCookies } = crawlingContext;

if (cookies) {
await addCookies(cookies);
}

await playwrightUtils.blockRequests(page, {
urlPatterns,
});

await page.setViewportSize({
width: 1280,
height: 720,
});
},
],
Is my code now. urlPatterns is an array of strings if i remove "playwrightUtils" i get the error, no matter what urlPatterns I put in there but I'm happy to use it from playwrightUtils. It's just a lot of work figuring out where to get what from crawlee in a way that works
Lukas Krivka
Lukas Krivkaβ€’2y ago
For me it works, maybe you just have old version of Crawlee
const crawler = new PlaywrightCrawler({
proxyConfiguration,
preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
const { blockRequests } = crawlingContext;
await blockRequests();
},
],
requestHandler: async (context) => {
console.dir(context, { depth: 0 });
},
});
const crawler = new PlaywrightCrawler({
proxyConfiguration,
preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
const { blockRequests } = crawlingContext;
await blockRequests();
},
],
requestHandler: async (context) => {
console.dir(context, { depth: 0 });
},
});
magic-amber
magic-amberOPβ€’2y ago
i have 3.5.8 what happens when you actually add page as an argument, since that's required?
Lukas Krivka
Lukas Krivkaβ€’2y ago
Ah I see, you must not add that as argument, that was your original bug
magic-amber
magic-amberOPβ€’2y ago
you have to add that as an argument, as stated in the docs: https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests
playwrightUtils | API | Crawlee
A namespace that contains various utilities for Playwright - the headless Chrome Node API. Example usage: ```javascript import { launchPlaywright, playwrightUtils } from 'crawlee'; // Navigate to https://www.example.com in Playwright with a POST request const browser = await launchPlaywright(); c...
magic-amber
magic-amberOPβ€’2y ago
the page parameter is not optional
robust-apricot
robust-apricotβ€’2y ago
I've been playing around with this a bit more and was surprised that Chromium does not load the entire html element when an image url is blocked. I would have expected to still see the <img> tag for a .jpg even when it's blocked, but the whole element is missing - I thought I could block images this way, but still get the src to optionally grab them later; this pattern does not seem to work since there is no element to get the src from. Is there a better way to optionally get certain images?
Lukas Krivka
Lukas Krivkaβ€’2y ago
You could get the naked HTML first and grab the src from there.
stuck-chocolate
stuck-chocolateβ€’2y ago
Hello, Has anyone managed to block requests to "optimizationguide-pa.googleapis.com"? Blocking "%google%" requests works fine but not "optimizationguide-pa.googleapis.com". You cannot not even see ""optimizationguide-pa.googleapis.com" in the Chrome DevTool. Any idea? Thanks Laurent

Did you find this page helpful?