magic-amberβ’2y ago
prevent downloading unneeded resources
I'd like to block tracking pixels and other unneeded files in the PlayWright crawler. It seems I can use
page.route
for that but in the context of PlaywrightCrawler
, would I just put those in requestHandler
? (Won't that be too late?)23 Replies
magic-amberOPβ’2y ago
no one? π¦
robust-apricotβ’2y ago
It seems like
preNavigationHooks
is what you're looking for: https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequestsJust keep in mind blocking requests like this disables cache and can get counterproductive. There is a better way https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests
playwrightUtils | API | Crawlee
A namespace that contains various utilities for
Playwright - the headless Chrome Node API.
Example usage:
```javascript
import { launchPlaywright, playwrightUtils } from 'crawlee';
// Navigate to https://www.example.com in Playwright with a POST request
const browser = await launchPlaywright();
c...
magic-amberOPβ’2y ago
how would you use that in the context of PlaywrightCrawler? I assume you'd be too late calling blockRequests inside the requestHandler
robust-apricotβ’2y ago
It seems to work when I try this inside the requestHandler, but only on the initial load - images loaded after a click don't seem to be blocked?
In preNavigationHooks like @cdslash mentioned.
Hmm, that's interesting. It might be the limitation of it, I will check with the team - https://github.com/apify/crawlee/blob/435d44eaa77bf419ba15baf6f51ddd963d7bbc46/packages/playwright-crawler/src/internals/utils/playwright-utils.ts#L284
GitHub
crawlee/packages/playwright-crawler/src/internals/utils/playwright-...
CrawleeβA web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - apify/crawlee
Google doesn't like extensive docs π
https://chromedevtools.github.io/devtools-protocol/tot/Network/#method-setBlockedURLs
Chrome DevTools Protocol
Chrome DevTools Protocol - version tot - Network domain
robust-apricotβ’2y ago
It takes true FAANG engineers to write docs that tell you
Network.setBlockedURLs
"Blocks URLs from loading" π
I was actually mistaken - it is still blocking the requests for e.g. .jpg
after the mouse click, but the images are still being loaded because they're loaded through a css background image where the url doesn't show the .jpg
extension so it doesn't match the glob patternMakes sense. That's the limitation of globs.
magic-amberOPβ’2y ago
Getting the
blockRequests
from crawlingContext
results in it always erroring with
ERROR PlaywrightCrawler: Request failed and reached maximum retries. ArgumentError: Cannot convert object to primitive value in object
options
regardless of which options are used.
Getting
blockRequests` from the playwrightUtils exported by crawlee does work.robust-apricotβ’2y ago
Could you share the code you're using?
@cdslash just advanced to level 2! Thanks for your contributions! π
magic-amberOPβ’2y ago
Is my code now. urlPatterns is an array of strings
if i remove "playwrightUtils" i get the error, no matter what urlPatterns I put in there
but I'm happy to use it from playwrightUtils. It's just a lot of work figuring out where to get what from crawlee in a way that works
For me it works, maybe you just have old version of Crawlee
magic-amberOPβ’2y ago
i have 3.5.8
what happens when you actually add page as an argument, since that's required?
Ah I see, you must not add that as argument, that was your original bug
magic-amberOPβ’2y ago
you have to add that as an argument, as stated in the docs: https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests
playwrightUtils | API | Crawlee
A namespace that contains various utilities for
Playwright - the headless Chrome Node API.
Example usage:
```javascript
import { launchPlaywright, playwrightUtils } from 'crawlee';
// Navigate to https://www.example.com in Playwright with a POST request
const browser = await launchPlaywright();
c...
magic-amberOPβ’2y ago
the page parameter is not optional
magic-amberOPβ’2y ago
ah, because the shape of https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequests is different
robust-apricotβ’2y ago
I've been playing around with this a bit more and was surprised that Chromium does not load the entire html element when an image url is blocked. I would have expected to still see the
<img>
tag for a .jpg
even when it's blocked, but the whole element is missing - I thought I could block images this way, but still get the src
to optionally grab them later; this pattern does not seem to work since there is no element to get the src
from. Is there a better way to optionally get certain images?You could get the naked HTML first and grab the src from there.
stuck-chocolateβ’2y ago
Hello,
Has anyone managed to block requests to "optimizationguide-pa.googleapis.com"?
Blocking "%google%" requests works fine but not "optimizationguide-pa.googleapis.com".
You cannot not even see ""optimizationguide-pa.googleapis.com" in the Chrome DevTool.
Any idea?
Thanks
Laurent