sensitive-blue•2y ago
playwright & pdf + error handling
Hello,
Playwright will throw an
net::ERR_ABORTED
when scraping any type of pdf file, so, the only way I've been able to figure out how to handle this is in a preNavigationHook, since, I can't catch the error in the routerhandler.
Does anyone have any better suggestions? I'm wondering if I should have two crawlers, playwright for normal pages, and then when it comes across an pdf send it to cheerio?
Thanks!7 Replies
sensitive-blueOP•2y ago
specifically I'm doing a page.on('download', in the preNavigationHook -- doesn't seem smart
sensitive-blueOP•2y ago
https://github.com/microsoft/playwright/issues/7822 i'm also doing this workaround in the preNavigationHook
GitHub
[Feature] Make PDF testing idiomatic · Issue #7822 · microsoft/play...
Customers are confused, when a PDF results in a PDF viewer and when in a download event. We should explain how to workaround it in the relevant browsers. #7830 #6091 #3509 #3365 #6342 #20633 To mak...
sensitive-blueOP•2y ago
tagging friends for help ❤️ @Pepa J @Lukas Krivka
Also, I guess another question is... how do you run two long-running crawlers? I ended up solving this by creating a BasicCrawler that just downloads the PDF
await Promise.all([crawler.run(), pdf_crawler.run()])
but trying somethign like this
and than this is my preInjectionHook which i'm not sure is working so well.
@bmax just advanced to level 1! Thanks for your contributions! 🎉
Hi @bmax ,
generally I don't think that using preNavigation hook is generally bad idea. You might exclude the urls for PDF files from crawling and download them directly via
got-scraping
.
Why you want to open PDF in Playwright? PDF has no DOM structure, therefore you would not be able to usual playwright calls anyway.
When using two crawlers, please make sure they are both using different RequestQueue to avoid conflicts, when one crawler is processing requests for the other one.
Not sure what you mean by sending PDF to cheerio, again PDF is not a HTML page, it has no DOM structure and cheerio is able to work only with XML based documents.sensitive-blueOP•2y ago
@Pepa J , thanks so much for the response. I ended up doing exactly what you said with the different request queues, and I used the basic crawler to send in any pdf.
The problem is I can’t just only exclude pdf urls because some urls don’t have the file extension in them as they redirect to a pdf or just use a content type to output it.
It took me way too long to figure out the request queue thing.
@Pepa J @Lukas Krivka is there a way to make
maxRequestsPerCrawl
per request queue and then create a new requestqueue every time I have a "spearate crawl" / is there a way to open a new queue and set it on a specific crawler??@bmax
is there a way to make maxRequestsPerCrawl per requestit is Crawler option so it has to be set on Crawler.
create a new requestqueue every time I have a "spearate crawl"Yes You may create new RequestQueue whenever you want
await Actor.openRequestQueue("my-nw-request-queue-1")
I am not sure if it is allowed to create multiple default (unnamed) RequestQueues in single run - I know it was issue in the past.
is there a way to open a new queue and set it on a specific crawler??You need to pass the RequestQueue to the Crawler options (
requestQueue
option).