sensitive-blue•2y ago

playwright & pdf + error handling

Hello, Playwright will throw an net::ERR_ABORTED when scraping any type of pdf file, so, the only way I've been able to figure out how to handle this is in a preNavigationHook, since, I can't catch the error in the routerhandler. Does anyone have any better suggestions? I'm wondering if I should have two crawlers, playwright for normal pages, and then when it comes across an pdf send it to cheerio? Thanks!

7 Replies

sensitive-blueOP•2y ago

specifically I'm doing a page.on('download', in the preNavigationHook -- doesn't seem smart

sensitive-blueOP•2y ago

https://github.com/microsoft/playwright/issues/7822 i'm also doing this workaround in the preNavigationHook

GitHub

[Feature] Make PDF testing idiomatic · Issue #7822 · microsoft/play...

Customers are confused, when a PDF results in a PDF viewer and when in a download event. We should explain how to workaround it in the relevant browsers. #7830 #6091 #3509 #3365 #6342 #20633 To mak...

sensitive-blueOP•2y ago

tagging friends for help ❤️ @Pepa J @Lukas Krivka Also, I guess another question is... how do you run two long-running crawlers? I ended up solving this by creating a BasicCrawler that just downloads the PDF await Promise.all([crawler.run(), pdf_crawler.run()]) but trying somethign like this and than this is my preInjectionHook which i'm not sure is working so well.

    async (crawlingContext, gotoOptions) => {
      gotoOptions.waitUntil = 'networkidle'
      const page = crawlingContext.page
      const request = crawlingContext.request
      await page.route('**/*.pdf', async route => {
        request.noRetry = true
        console.log('running pdf', request.url)
        const crawler_request = new CrawleeRequest({ url: request.url, userData: request.userData })
        await pdf_crawler.addRequests([crawler_request])
      })

      page.on('download', async (download: Download) => {
        request.noRetry = true
        console.log('running download', request.url)
        const crawler_request = new CrawleeRequest({ url: request.url, userData: request.userData })
        await pdf_crawler.addRequests([crawler_request])
      })
    },

    async (crawlingContext, gotoOptions) => {
      gotoOptions.waitUntil = 'networkidle'
      const page = crawlingContext.page
      const request = crawlingContext.request
      await page.route('**/*.pdf', async route => {
        request.noRetry = true
        console.log('running pdf', request.url)
        const crawler_request = new CrawleeRequest({ url: request.url, userData: request.userData })
        await pdf_crawler.addRequests([crawler_request])
      })

      page.on('download', async (download: Download) => {
        request.noRetry = true
        console.log('running download', request.url)
        const crawler_request = new CrawleeRequest({ url: request.url, userData: request.userData })
        await pdf_crawler.addRequests([crawler_request])
      })
    },

MEE6•2y ago

@bmax just advanced to level 1! Thanks for your contributions! 🎉

Pepa J•2y ago

Hi @bmax , generally I don't think that using preNavigation hook is generally bad idea. You might exclude the urls for PDF files from crawling and download them directly via got-scraping . Why you want to open PDF in Playwright? PDF has no DOM structure, therefore you would not be able to usual playwright calls anyway. When using two crawlers, please make sure they are both using different RequestQueue to avoid conflicts, when one crawler is processing requests for the other one. Not sure what you mean by sending PDF to cheerio, again PDF is not a HTML page, it has no DOM structure and cheerio is able to work only with XML based documents.

sensitive-blueOP•2y ago

@Pepa J , thanks so much for the response. I ended up doing exactly what you said with the different request queues, and I used the basic crawler to send in any pdf. The problem is I can’t just only exclude pdf urls because some urls don’t have the file extension in them as they redirect to a pdf or just use a content type to output it. It took me way too long to figure out the request queue thing. @Pepa J @Lukas Krivka is there a way to make maxRequestsPerCrawl per request queue and then create a new requestqueue every time I have a "spearate crawl" / is there a way to open a new queue and set it on a specific crawler??

Pepa J•2y ago

@bmax

is there a way to make maxRequestsPerCrawl per request

it is Crawler option so it has to be set on Crawler.

create a new requestqueue every time I have a "spearate crawl"

Yes You may create new RequestQueue whenever you want await Actor.openRequestQueue("my-nw-request-queue-1") I am not sure if it is allowed to create multiple default (unnamed) RequestQueues in single run - I know it was issue in the past.

is there a way to open a new queue and set it on a specific crawler??

You need to pass the RequestQueue to the Crawler options ( requestQueue option).

playwright & pdf + error handling

Did you find this page helpful?