grumpy-cyan•3y ago

Getting puppeteer-har and autoconsent to work with puppeteer crawler

Hi guys, I am totally new to crawlee, so this might or might not be an easy question. I want to get all the cookies and the third party trackers or resources from our website and monitor any changes. The changes are done by a Website Agency and I want to be sure we keep compliant with the privacy regulations. So I thought it would be a good Idea to use duckduckgos autoconsent to first consent to all cookies. Then I want to list all connections that are made e.g. by google fonts or CDNs. For this I thought of using puppeteer-har. I have originally done this with vanilla puppeteer and this worked, but I need a crawler to get all the links on our website. So I stumbled upon crawlee. I tried to put my original script inside the requestHandler but the result.har file is empty:

{"log":{"version":"1.2","creator":{"name":"chrome-har","version":"0.11.12","comment":"https://github.com/sitespeedio/chrome-har"},"pages":[],"entries":[]}}

I guess this is due the page.goto method already invoked before puppeteer-har is initialized. So I need to build something like this:

const puppeteer = require('puppeteer');
const PuppeteerHar = require('puppeteer-har');
 
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
 
  const har = new PuppeteerHar(page);
  await har.start({ path: 'results.har' });
 
  await page.goto('http://example.com');
 
  await har.stop();
  await browser.close();
})();

with puppeteerCrawler. If I am totally lost and all of this can be done much easier, just tell me. Thanks for your time and your answers!

1 Reply

Lukas Krivka•3y ago

You need to start the collection in preNavigationHooks and stop it in requestHandler You need to connect these two so I recommend just having a map object between request.uniqueKey and the initialized har object

Getting puppeteer-har and autoconsent to work with puppeteer crawler

Did you find this page helpful?