4 replies

Trying out Crawlee, etsy not working..

Hi Apify,

Thank you for this fine auto-scraping tool Crawlee! I wanted to try out along with the tutorial but with different url e.g. https://www.etsy.com/search?q=wooden%20box but it failed with PlaywrightCrawler.

// For more information, see https://crawlee.dev/
import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';


// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
     launchContext: {
        launcher: firefox,
    },
    maxRequestRetries: 1,
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, page, enqueueLinks, log, pushData }) {
        await page.waitForTimeout(5000);
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        // await enqueueLinks();
    },
    // Comment this option to scrape the full website.
    maxRequestsPerCrawl: 1,
    // Uncomment this option to see the browser window.
    headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://www.etsy.com/search?q=wooden%20box']);
//await crawler.run(['https://www.etsy.com']); //works
//await crawler.run(['https://www.amazon.com']); //works

// For more information, see https://crawlee.dev/
import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';


// PlaywrightCrawler crawls the web using a headless
// browser controlled by the Playwright library.
const crawler = new PlaywrightCrawler({
     launchContext: {
        launcher: firefox,
    },
    maxRequestRetries: 1,
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({ request, page, enqueueLinks, log, pushData }) {
        await page.waitForTimeout(5000);
        const title = await page.title();
        log.info(`Title of ${request.loadedUrl} is '${title}'`);

        // Save results as JSON to ./storage/datasets/default
        await pushData({ title, url: request.loadedUrl });

        // Extract links from the current page
        // and add them to the crawling queue.
        // await enqueueLinks();
    },
    // Comment this option to scrape the full website.
    maxRequestsPerCrawl: 1,
    // Uncomment this option to see the browser window.
    headless: false,
});

// Add first URL to the queue and start the crawl.
await crawler.run(['https://www.etsy.com/search?q=wooden%20box']);
//await crawler.run(['https://www.etsy.com']); //works
//await crawler.run(['https://www.amazon.com']); //works

It seems to fail at Checking device, I thought it injected TLS fingerprint and Browser fingperint but it seems Etsy still blocks it with 403!

Thank you!

Trying out Crawlee, etsy not working..

Similar Threads