Eric
Eric2w ago

Downloading pdfs and other files

When crawling, whenever a download starts in a webpage (pdf or similar), crawlee errors. What would be the correct way to catch this error and do the download myself? I have a similar problem with XMLs. In other words I am using a Playwright crawler but I want to be able to download content (and parse + enqueue links) on my own when my crawler can't. I was thinking on having another request queue for pdfs that I dequeue using another crawler (and add to this queue infering it is a pdf from the url for example), but I was hoping there was an easier way, like defining a fallback parser or similar. Thanks in advance!
4 Replies
Mantisus
Mantisus2w ago
Hey @Eric
I was thinking on having another request queue for pdfs that I dequeue using another crawler
I would choose this particular approach. Usually, file links are regular static links that can be processed using HttpCrawler. So you can use context.enqueue_links(selector=‘[selector for pdf links]’, rq_name=‘pdf_crawler’) in PlaywrightCrawler to pass the links to the HttpCrawler queue. The pdf_crawler queue must be created in advance.
Eric
EricOP2w ago
this means I have to detect if a document is a pdf when enqueuing no? so only from the url (which isn't always obvious that it is a pdf). Is there a way I can detect this after crawlee detects it requires a download?
Mantisus
Mantisus2w ago
Then you can use the following
import asyncio

from playwright.async_api import Error
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
crawler = PlaywrightCrawler(
max_request_retries=1
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

await context.enqueue_links()

@crawler.error_handler
async def error_handler(context: PlaywrightCrawlingContext, error: Exception) -> None:
if isinstance(error, Error) and 'Download is starting' in error.message:
context.log.error(f'Error processing {context.request.url}')
import asyncio

from playwright.async_api import Error
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext


async def main() -> None:
crawler = PlaywrightCrawler(
max_request_retries=1
)

@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url}')

await context.enqueue_links()

@crawler.error_handler
async def error_handler(context: PlaywrightCrawlingContext, error: Exception) -> None:
if isinstance(error, Error) and 'Download is starting' in error.message:
context.log.error(f'Error processing {context.request.url}')
Eric
EricOP2w ago
oh nice! thanks!

Did you find this page helpful?