5 replies

Downloading pdfs and other files

When crawling, whenever a download starts in a webpage (pdf or similar), crawlee errors. What would be the correct way to catch this error and do the download myself? I have a similar problem with XMLs. In other words I am using a Playwright crawler but I want to be able to download content (and parse + enqueue links) on my own when my crawler can't. I was thinking on having another request queue for pdfs that I dequeue using another crawler (and add to this queue infering it is a pdf from the url for example), but I was hoping there was an easier way, like defining a fallback parser or similar. Thanks in advance!

Downloading pdfs and other files

Similar Threads