ambitious-aqua•3y ago
How can i skip .pdf files in PuppeteerCrawler
I want to skip all .pdf, .docx files from crawling.
1 Reply
metropolitan-bronze•3y ago
See
playwrightUtils.blockRequests
Forces the Playwright browser tab to block loading URLs that match a provided pattern. This is useful to speed up crawling of websites, since it reduces the amount of data that needs to be downloaded from the web, but it may break some websites or unexpectedly prevent loading of resources.
By default, the function will block all URLs including the following patterns:
[".css", ".jpg", ".jpeg", ".png", ".svg", ".gif", ".woff", ".pdf", ".zip"]
https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequestsplaywrightUtils | API | Crawlee
A namespace that contains various utilities for
Playwright - the headless Chrome Node API.
Example usage:
```javascript
import { launchPlaywright, playwrightUtils } from 'crawlee';
// Navigate to https://www.example.com in Playwright with a POST request
const browser = await launchPlaywright();
c...