variable-lime
variable-lime11mo ago

which browser is the best to crawl

As title said I’m using chromium currently but it is cpu heavy in usage Killing browser do not kill the process and because of that it’s easy to get 100% cpu usage pretty quickly (I’m crawling thousands of websites where on each I’m looking for different data) I already try to load pure html without css, images and other assets, that helped a lot but issue is still there
4 Replies
Hall
Hall11mo ago
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
Lukas Celnar
Lukas Celnar11mo ago
Hi @Wojciech I recommend also blocking unnecessary network requests. with the blockRequests Make sure that are running it in headless mode. Also you could try using cheerio if the use-case allows it. Regarding your question about the browser: Firefox tends to be lighter on CPU usage.
Using Firefox browser with Playwright crawler | Crawlee · Build rel...
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
PlaywrightCrawlingContext | API | Crawlee · Build reliable crawlers...
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
variable-lime
variable-limeOP11mo ago
yes I already do that
const launchContext: PlaywrightLaunchContext = {
launcher: firefox,
launchOptions: {
headless: false,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
],
},
useChrome: false, // Use Chromium instead of Chrome for better performance
userAgent: userAgents[Math.floor(Math.random() * userAgents.length)],
}
...
launchContext,
preNavigationHooks: [
async ({ page }) => {
await playwrightUtils.blockRequests(page, {
urlPatterns: [
'.png',
'.jpg',
'.jpeg',
'.gif',
'.svg',
'.ico',
'.woff',
'.woff2',
'adsbygoogle.js',
],
extraUrlPatterns: ['adsbygoogle.js'],
})

await playwrightUtils.closeCookieModals(page)
},
],
const launchContext: PlaywrightLaunchContext = {
launcher: firefox,
launchOptions: {
headless: false,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
],
},
useChrome: false, // Use Chromium instead of Chrome for better performance
userAgent: userAgents[Math.floor(Math.random() * userAgents.length)],
}
...
launchContext,
preNavigationHooks: [
async ({ page }) => {
await playwrightUtils.blockRequests(page, {
urlPatterns: [
'.png',
'.jpg',
'.jpeg',
'.gif',
'.svg',
'.ico',
'.woff',
'.woff2',
'adsbygoogle.js',
],
extraUrlPatterns: ['adsbygoogle.js'],
})

await playwrightUtils.closeCookieModals(page)
},
],
unfortunetly I recive: WARN Playwright Utils: blockRequests() helper is incompatible with non-Chromium browsers. I didn't know that 😄
Oleg V.
Oleg V.10mo ago
you can block requests manually (I mean not using util func) Example:
const BLOCKED = ['image', 'stylesheet', 'media', 'font','other'];

Then within your preNavigationHooks of your crawler, add this function:
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED.includes(route.request().resourceType())) return route.abort();
return route.continue()
});
};
const BLOCKED = ['image', 'stylesheet', 'media', 'font','other'];

Then within your preNavigationHooks of your crawler, add this function:
async ({ page }) => {
await page.route('**/*', (route) => {
if (BLOCKED.includes(route.request().resourceType())) return route.abort();
return route.continue()
});
};

Did you find this page helpful?