magic-amber•2y ago

prevent downloading unneeded resources

I'd like to block tracking pixels and other unneeded files in the PlayWright crawler. It seems I can use page.route for that but in the context of PlaywrightCrawler, would I just put those in requestHandler? (Won't that be too late?)

23 Replies

magic-amberOP•2y ago

no one? 😦

robust-apricot•2y ago

It seems like preNavigationHooks is what you're looking for: https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequests

PlaywrightCrawlingContext | API | Crawlee

Lukas Krivka•2y ago

Just keep in mind blocking requests like this disables cache and can get counterproductive. There is a better way https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests

playwrightUtils | API | Crawlee

A namespace that contains various utilities for Playwright - the headless Chrome Node API. Example usage: ```javascript import { launchPlaywright, playwrightUtils } from 'crawlee'; // Navigate to https://www.example.com in Playwright with a POST request const browser = await launchPlaywright(); c...

magic-amberOP•2y ago

how would you use that in the context of PlaywrightCrawler? I assume you'd be too late calling blockRequests inside the requestHandler

robust-apricot•2y ago

It seems to work when I try this inside the requestHandler, but only on the initial load - images loaded after a click don't seem to be blocked?

Lukas Krivka•2y ago

In preNavigationHooks like @cdslash mentioned.

preNavigationHooks: [
    async (crawlingContext, gotoOptions) => {
        const { blockRequests } = crawlingContext;
        await blockRequests();
    },
]

preNavigationHooks: [
    async (crawlingContext, gotoOptions) => {
        const { blockRequests } = crawlingContext;
        await blockRequests();
    },
]

Lukas Krivka•2y ago

Hmm, that's interesting. It might be the limitation of it, I will check with the team - https://github.com/apify/crawlee/blob/435d44eaa77bf419ba15baf6f51ddd963d7bbc46/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L284

GitHub

crawlee/packages/playwright-crawler/src/internals/utils/playwright-...

Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - apify/crawlee

Lukas Krivka•2y ago

Google doesn't like extensive docs 😅 https://chromedevtools.github.io/devtools-protocol/tot/Network/#method-setBlockedURLs

Chrome DevTools Protocol

Chrome DevTools Protocol - version tot - Network domain

robust-apricot•2y ago

It takes true FAANG engineers to write docs that tell you Network.setBlockedURLs "Blocks URLs from loading" 😂 I was actually mistaken - it is still blocking the requests for e.g. .jpg after the mouse click, but the images are still being loaded because they're loaded through a css background image where the url doesn't show the .jpg extension so it doesn't match the glob pattern

Lukas Krivka•2y ago

Makes sense. That's the limitation of globs.

magic-amberOP•2y ago

Getting the blockRequests from crawlingContext results in it always erroring with

ERROR PlaywrightCrawler: Request failed and reached maximum retries. ArgumentError: Cannot convert object to primitive value in object

options


regardless of which options are used.

Getting

blockRequests` from the playwrightUtils exported by crawlee does work.

robust-apricot•2y ago

Could you share the code you're using?

MEE6•2y ago

@cdslash just advanced to level 2! Thanks for your contributions! 🎉

magic-amberOP•2y ago

preNavigationHooks: [
        async (crawlingContext, gotoOptions) => {
          const { blockRequests, page, addCookies } = crawlingContext;

          if (cookies) {
            await addCookies(cookies);
          }

          await playwrightUtils.blockRequests(page, {
            urlPatterns,
          });

          await page.setViewportSize({
            width: 1280,
            height: 720,
          });
        },
      ],

preNavigationHooks: [
        async (crawlingContext, gotoOptions) => {
          const { blockRequests, page, addCookies } = crawlingContext;

          if (cookies) {
            await addCookies(cookies);
          }

          await playwrightUtils.blockRequests(page, {
            urlPatterns,
          });

          await page.setViewportSize({
            width: 1280,
            height: 720,
          });
        },
      ],

Is my code now. urlPatterns is an array of strings if i remove "playwrightUtils" i get the error, no matter what urlPatterns I put in there but I'm happy to use it from playwrightUtils. It's just a lot of work figuring out where to get what from crawlee in a way that works

Lukas Krivka•2y ago

For me it works, maybe you just have old version of Crawlee

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    preNavigationHooks: [
        async (crawlingContext, gotoOptions) => {
            const { blockRequests } = crawlingContext;
            await blockRequests();
        },
    ],
    requestHandler: async (context) => {
        console.dir(context, { depth: 0 });
    },
});

const crawler = new PlaywrightCrawler({
    proxyConfiguration,
    preNavigationHooks: [
        async (crawlingContext, gotoOptions) => {
            const { blockRequests } = crawlingContext;
            await blockRequests();
        },
    ],
    requestHandler: async (context) => {
        console.dir(context, { depth: 0 });
    },
});

magic-amberOP•2y ago

i have 3.5.8 what happens when you actually add page as an argument, since that's required?

Lukas Krivka•2y ago

Ah I see, you must not add that as argument, that was your original bug

magic-amberOP•2y ago

you have to add that as an argument, as stated in the docs: https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests

playwrightUtils | API | Crawlee

magic-amberOP•2y ago

the page parameter is not optional

magic-amberOP•2y ago

ah, because the shape of https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequests is different

PlaywrightCrawlingContext | API | Crawlee

robust-apricot•2y ago

I've been playing around with this a bit more and was surprised that Chromium does not load the entire html element when an image url is blocked. I would have expected to still see the <img> tag for a .jpg even when it's blocked, but the whole element is missing - I thought I could block images this way, but still get the src to optionally grab them later; this pattern does not seem to work since there is no element to get the src from. Is there a better way to optionally get certain images?

Lukas Krivka•2y ago

You could get the naked HTML first and grab the src from there.

stuck-chocolate•2y ago

Hello, Has anyone managed to block requests to "optimizationguide-pa.googleapis.com"? Blocking "%google%" requests works fine but not "optimizationguide-pa.googleapis.com". You cannot not even see ""optimizationguide-pa.googleapis.com" in the Chrome DevTool. Any idea? Thanks Laurent

prevent downloading unneeded resources

Did you find this page helpful?