optimistic-gold
optimistic-gold•2y ago

Add certificates to Playwright crawler using Chromium

hey folks, we are trying to integrate a proxy into our crawlers and the issue is the proxy needs certificate to be present before it'll allow us to authenticate, I couldnt find any option for this in the documentation. Is there a way I can add those certs in crawlee/playwright? or if crawlee exposes agentOptions from Playwright anywhere (couldn't find it in the docs), that'll also work as per https://github.com/microsoft/playwright/issues/1799#issuecomment-959011162
GitHub
[Feature] allow client certificate selection and settings from Java...
Similarly to puppeteer/puppeteer#540 Currently when navigating to a page that requires client certificates and client certificates are available a popup is shown in Firefox and Chrome which asks to...
13 Replies
optimistic-gold
optimistic-goldOP•2y ago
P.S. I have added that certificate on my server and curl is working fine but crawlee is not so I'm assuming crawlee is not picking it up and the error I'm getting is page.goto: net::ERR_PROXY_CONNECTION_FAILED at
Pepa J
Pepa J•2y ago
Hi @AltairSama2 , Crawlee is using Playwright under the hood, so you should be able to intercept request in usual way. There I found an example for Playwright itself ( https://github.com/microsoft/playwright/issues/1799#issuecomment-959011162 ). Can you do minimal working example using only Playwright (witohut Crawlee) to confirm that the issue is in Crawlee and not in Playwright itself? - I found a lot of issues regarding using certificated in Playwright.
GitHub
[Feature] allow client certificate selection and settings from Java...
Similarly to puppeteer/puppeteer#540 Currently when navigating to a page that requires client certificates and client certificates are available a popup is shown in Firefox and Chrome which asks to...
optimistic-gold
optimistic-goldOP•2y ago
hey, when I use playwright. it gives me a cert invalid error and which I can bypass by using
launchOptions: {
args: ['--ignore-certificate-errors'],
},
launchOptions: {
args: ['--ignore-certificate-errors'],
},
but with crawlee its not working I think I got it wrong, I dont need to use the proxy's certificate with playwright/crawlee its just a proxy config issue page.goto: net::ERR_PROXY_CONNECTION_FAILED here's the full error we bypassed it by avoiding the cert route and its working fine for us
Pepa J
Pepa J•2y ago
Does the same proxy configuration works for other websites?
optimistic-gold
optimistic-goldOP•2y ago
not with crawlee but with playwright yeah it worked with crawlee once we full onboarded with the proxy provider and we didnt need to use their cert
Pepa J
Pepa J•2y ago
@AltairSama2 Can you please provide code snippet with your current configration for Crawlee?
optimistic-gold
optimistic-goldOP•2y ago
chromium.use(stealthPlugin());
let queue = await RequestQueue.open('crawler');
await queue.drop();
queue = await RequestQueue.open('crawler');
const startUrls = [`url`];
const router = await initRouter({ resume, numPages, initialPage });
const crawler = new PlaywrightCrawler({
requestHandler: router,
maxRequestsPerMinute: 100,
log: new Log({
logger: new CrawlerLogger(log.getOptions(), 'CRAWLER_1'), // please ignore, custom logger imp
level: log.LEVELS.DEBUG,
}),
requestQueue: queue,
launchContext: {
launcher: chromium,
launchOptions: {
args: ['--ignore-certificate-errors'],
},
},
...(useProxy && {
proxyConfiguration: new ProxyConfiguration({
proxyUrls: [
proxy,
],
}),
useSessionPool: true,
persistCookiesPerSession: true,
}),

});
await crawler.run(startUrls, {});
chromium.use(stealthPlugin());
let queue = await RequestQueue.open('crawler');
await queue.drop();
queue = await RequestQueue.open('crawler');
const startUrls = [`url`];
const router = await initRouter({ resume, numPages, initialPage });
const crawler = new PlaywrightCrawler({
requestHandler: router,
maxRequestsPerMinute: 100,
log: new Log({
logger: new CrawlerLogger(log.getOptions(), 'CRAWLER_1'), // please ignore, custom logger imp
level: log.LEVELS.DEBUG,
}),
requestQueue: queue,
launchContext: {
launcher: chromium,
launchOptions: {
args: ['--ignore-certificate-errors'],
},
},
...(useProxy && {
proxyConfiguration: new ProxyConfiguration({
proxyUrls: [
proxy,
],
}),
useSessionPool: true,
persistCookiesPerSession: true,
}),

});
await crawler.run(startUrls, {});
Pepa J
Pepa J•2y ago
@AltairSama2 Thank you for your feedback, I am currently investigating this with the Crawlee developer team. Would it be possible to also provide us with the pure Playwright solution code, that is currently working for you? Is the certificate taken from system or are you importing it on application level?
optimistic-gold
optimistic-goldOP•2y ago
its taken from the system
MEE6
MEE6•2y ago
@AltairSama2 just advanced to level 7! Thanks for your contributions! 🎉
optimistic-gold
optimistic-goldOP•2y ago
I was trying to figure out how to do it on an app level but couldnt make it work but in the end system level worked fine here's the pure playwright code
const browser = await chromium.launch(
{
proxy:{
server:"proxy_url",
username:"username",
password:"pwd"
},
args: ['--ignore-certificate-errors'],
}
)

const page = await browser.newPage()

await page.goto('https://google.com')
const html = await page.innerHTML('body')
console.log(html)
const browser = await chromium.launch(
{
proxy:{
server:"proxy_url",
username:"username",
password:"pwd"
},
args: ['--ignore-certificate-errors'],
}
)

const page = await browser.newPage()

await page.goto('https://google.com')
const html = await page.innerHTML('body')
console.log(html)
I think it was an issue on our end, because after full acc activation with the proxy provider, it worked just fine, only issues we are currently facing is that a lot of our requests are failing with the proxy but thats unrelated to this is probably a config issue
Pepa J
Pepa J•2y ago
@AltairSama2 You should be able to replicate this event in Crawlee:
const crawler = new PlaywrightCrawler({
// ... ,
launchContext: {
launchOptions: {
proxy: {
'server': 'http://proxy_url',
'username': 'username',
'password': 'password'
},
args: ['--ignore-certificate-errors'],
}
}
});
const crawler = new PlaywrightCrawler({
// ... ,
launchContext: {
launchOptions: {
proxy: {
'server': 'http://proxy_url',
'username': 'username',
'password': 'password'
},
args: ['--ignore-certificate-errors'],
}
}
});
and drop the proxyConfiguration attributte. And please let me know if it helped 🙂
optimistic-gold
optimistic-goldOP•2y ago
hey thanks! really appreciate it I can't repro the original issue because we are not relying on the certs anymore but this method is also working for us

Did you find this page helpful?