wet-aquaW
Apify & Crawlee4y ago
1 reply
wet-aqua

Getting puppeteer-har and autoconsent to work with puppeteer crawler

Hi guys,
I am totally new to crawlee, so this might or might not be an easy question.

I want to get all the cookies and the third party trackers or resources from our website and monitor any changes. The changes are done by a Website Agency and I want to be sure we keep compliant with the privacy regulations.

So I thought it would be a good Idea to use duckduckgos autoconsent to first consent to all cookies. Then I want to list all connections that are made e.g. by google fonts or CDNs. For this I thought of using puppeteer-har.

I have originally done this with vanilla puppeteer and this worked, but I need a crawler to get all the links on our website. So I stumbled upon crawlee. I tried to put my original script inside the requestHandler but the result.har file is empty:
{"log":{"version":"1.2","creator":{"name":"chrome-har","version":"0.11.12","comment":"https://github.com/sitespeedio/chrome-har"},"pages":[],"entries":[]}}

I guess this is due the page.goto method already invoked before puppeteer-har is initialized. So I need to build something like this:
const puppeteer = require('puppeteer'); const PuppeteerHar = require('puppeteer-har'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); const har = new PuppeteerHar(page); await har.start({ path: 'results.har' }); await page.goto('http://example.com'); await har.stop(); await browser.close(); })();

with puppeteerCrawler.

If I am totally lost and all of this can be done much easier, just tell me.
Thanks for your time and your answers!
Was this page helpful?