Downloading pdfs and other files
When crawling, whenever a download starts in a webpage (pdf or similar), crawlee errors. What would be the correct way to catch this error and do the download myself? I have a similar problem with XMLs. In other words I am using a Playwright crawler but I want to be able to download content (and parse + enqueue links) on my own when my crawler can't. I was thinking on having another request queue for pdfs that I dequeue using another crawler (and add to this queue infering it is a pdf from the url for example), but I was hoping there was an easier way, like defining a fallback parser or similar. Thanks in advance!
4 Replies
Hey @Eric
I was thinking on having another request queue for pdfs that I dequeue using another crawlerI would choose this particular approach. Usually, file links are regular static links that can be processed using
HttpCrawler.
So you can use context.enqueue_links(selector=‘[selector for pdf links]’, rq_name=‘pdf_crawler’) in PlaywrightCrawler to pass the links to the HttpCrawler queue.
The pdf_crawler queue must be created in advance.this means I have to detect if a document is a pdf when enqueuing no? so only from the url (which isn't always obvious that it is a pdf). Is there a way I can detect this after crawlee detects it requires a download?
Then you can use the following
oh nice! thanks!