grumpy-cyan•3y ago
Getting puppeteer-har and autoconsent to work with puppeteer crawler
Hi guys,
I am totally new to crawlee, so this might or might not be an easy question.
I want to get all the cookies and the third party trackers or resources from our website and monitor any changes. The changes are done by a Website Agency and I want to be sure we keep compliant with the privacy regulations.
So I thought it would be a good Idea to use duckduckgos autoconsent to first consent to all cookies. Then I want to list all connections that are made e.g. by google fonts or CDNs. For this I thought of using puppeteer-har.
I have originally done this with vanilla puppeteer and this worked, but I need a crawler to get all the links on our website. So I stumbled upon crawlee. I tried to put my original script inside the requestHandler but the result.har file is empty:
{"log":{"version":"1.2","creator":{"name":"chrome-har","version":"0.11.12","comment":"https://github.com/sitespeedio/chrome-har"},"pages":[],"entries":[]}}
I guess this is due the page.goto method already invoked before puppeteer-har is initialized. So I need to build something like this:
const puppeteer = require('puppeteer');
const PuppeteerHar = require('puppeteer-har');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const har = new PuppeteerHar(page);
await har.start({ path: 'results.har' });
await page.goto('http://example.com');
await har.stop();
await browser.close();
})();
with puppeteerCrawler.
If I am totally lost and all of this can be done much easier, just tell me.
Thanks for your time and your answers!1 Reply
You need to start the collection in
preNavigationHooks
and stop it in requestHandler
You need to connect these two so I recommend just having a map object between request.uniqueKey
and the initialized har
object