Casper
Casper•4y ago

Disable image in playwright

How can I disable downloading images and videos and other media globally for my scraper?
22 Replies
MEE6
MEE6•4y ago
@Casper just advanced to level 8! Thanks for your contributions! 🎉
ratty-blush
ratty-blush•4y ago
You can create an array of resourceTypes that you'd like to block.
const BLOCKED = ['image', 'stylesheet', 'media', 'font','other'];
const BLOCKED = ['image', 'stylesheet', 'media', 'font','other'];
Then within your preNavigationHooks of your crawler, add this function:
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED.includes(route.request().resourceType())) return route.abort();
return route.continue()
});
};
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED.includes(route.request().resourceType())) return route.abort();
return route.continue()
});
};
Casper
CasperOP•4y ago
Thanks I will try that
ratty-blush
ratty-blush•4y ago
You can also check out this article https://scrapingant.com/blog/block-requests-playwright
Block resources with Playwright | ScrapingAnt Blog
This article will show you how to intercept and block requests with Playwright using the request interception API. Learn how to block images, CSS and Javascript loading.
Casper
CasperOP•4y ago
Thanks
Casper
CasperOP•4y ago
I have this in my main.ts file:
No description
Casper
CasperOP•4y ago
it does not work yet, can you spot an error?
Casper
CasperOP•4y ago
I inject it here:
No description
ratty-blush
ratty-blush•4y ago
Just add the function directly into the crawler
const playwrightCrawler = new PlaywrightCrawler({
proxyConfiguration,
requestHandler: playwrightRouter,
requestQueue: playwrightRequestQueue,
headless: true,
launchContext: {
launcher: firefox,
},
preNavigationHooks: [
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED_RESOURCES.includes(route.request().resourceType())) {
return route.abort();
}

return route.continue();
});
},
],
autoscaledPoolOptions: {
desiredConcurrency: 6,
},
navigationTimeoutSecs: 45,
requestHandlerTimeoutSecs: PLACE_ID_REQUESTS_CHUNK_SIZE * 15,
maxRequestRetries: 4,
// ! development only
// maxRequestsPerCrawl: 1,
});
const playwrightCrawler = new PlaywrightCrawler({
proxyConfiguration,
requestHandler: playwrightRouter,
requestQueue: playwrightRequestQueue,
headless: true,
launchContext: {
launcher: firefox,
},
preNavigationHooks: [
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED_RESOURCES.includes(route.request().resourceType())) {
return route.abort();
}

return route.continue();
});
},
],
autoscaledPoolOptions: {
desiredConcurrency: 6,
},
navigationTimeoutSecs: 45,
requestHandlerTimeoutSecs: PLACE_ID_REQUESTS_CHUNK_SIZE * 15,
maxRequestRetries: 4,
// ! development only
// maxRequestsPerCrawl: 1,
});
Here's one of my crawlers using the preNavigationHook
Casper
CasperOP•4y ago
thanks it works however I dont get why I consume so much bandwidth
Casper
CasperOP•4y ago
is it possible to see all the requests made for each url eg: https://dk.trustpilot.com/review/www.diba.dk
Trustpilot
Diba Billån er bedømt "Fremragende" med 4,8 / 5 på Trustpilot
Er du enig i TrustScoren for Diba Billån? Del din mening i dag, og find ud af, hvad 665 kunder allerede har sagt.
Casper
CasperOP•4y ago
so I can inspect and see which requests are unnecessary in playwright or do I need to use chrome dev tools for that
ratty-blush
ratty-blush•4y ago
The reason is because request interception disables cache in Playwright, so you are downloading everything every single time
ratty-blush
ratty-blush•4y ago
Apify
Cache responses in Puppeteer · Apify
Why and how to cache responses in memory using Puppeteer
ratty-blush
ratty-blush•4y ago
It is possible to see them all! Just add this function to your prenavigation hooks:
async ({ page }) => {
page.on('request', (req) => console.log(req))
};
async ({ page }) => {
page.on('request', (req) => console.log(req))
};
ratty-blush
ratty-blush•4y ago
All of this stuff is covered in our Playwright/Puppeteer course in the academy: https://developers.apify.com/academy/puppeteer-playwright
Apify
Puppeteer & Playwright · Apify Developers
Learn in-depth how to use two of the most popular Node.js libraries for controlling a headless browser - Puppeteer and Playwright.
Casper
CasperOP•4y ago
thanks. I have this but I can not get access to the url, I pre sume because I need to await it, but I cant use await there:
No description
ratty-blush
ratty-blush•4y ago
req.url() is a function and does not need to be awaited.
page.on('request', (req) => console.log(req.url()));
page.on('request', (req) => console.log(req.url()));
Casper
CasperOP•4y ago
thanks. I missed the ()
ratty-blush
ratty-blush•4y ago
I agree that it should be a getter instead of a function. req.url makes much more sense than req.url().
Casper
CasperOP•4y ago
yeah but it is a small issue amazing how much bandwidth is saved by cache: 98 requests 1.6 MB without cache 96 requests 54 KB with cache Is there a better option to not download unnecessary files than manually intercepting requests?
ratty-blush
ratty-blush•4y ago
Nope sadly

Did you find this page helpful?