❓ Help Needed: Downloading Linked PDF Files with Crawlee 🕸📥

Hello everyone,

I need some help with Crawlee. I've been using CheerioCrawler to scrape pages and I've managed to extract links and store page titles and URLs into a dataset. Now I want to add functionality to download linked files, like PDFs, from the scraped pages. However, I'm unsure how to do this natively with Crawlee.

Here's my current code:

import { CheerioCrawler, Dataset } from 'crawlee';

// CheerioCrawler crawls the web using HTTP requests
// and parses HTML using the Cheerio library.
const crawler = new CheerioCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, $, enqueueLinks, log }) {
        const title = $('title').text();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://cloudflare.net/events-and-presentations']);

import { CheerioCrawler, Dataset } from 'crawlee';

// CheerioCrawler crawls the web using HTTP requests
// and parses HTML using the Cheerio library.
const crawler = new CheerioCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, $, enqueueLinks, log }) {
        const title = $('title').text();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://cloudflare.net/events-and-presentations']);

Could anyone guide me on how to modify this code to download linked files, specifically PDFs, from the scraped pages? Any help would be appreciated, thank you!

import { CheerioCrawler, Dataset } from 'crawlee';

// CheerioCrawler crawls the web using HTTP requests
// and parses HTML using the Cheerio library.
const crawler = new CheerioCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, $, enqueueLinks, log }) {
        const title = $('title').text();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://cloudflare.net/events-and-presentations']);

import { CheerioCrawler, Dataset } from 'crawlee';

// CheerioCrawler crawls the web using HTTP requests
// and parses HTML using the Cheerio library.
const crawler = new CheerioCrawler({
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, $, enqueueLinks, log }) {
        const title = $('title').text();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await Dataset.pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        await enqueueLinks();
    },
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://cloudflare.net/events-and-presentations']);

Could anyone guide me on how to modify this code to download linked files, specifically PDFs, from the scraped pages? Any help would be appreciated, thank you!

❓ Help Needed: Downloading Linked PDF Files with Crawlee 🕸📥

Similar Threads

❓ Help Needed: Downloading Linked PDF Files with Crawlee 🕸📥

Similar Threads

Similar Threads

Similar Threads