wellfound.com
and see this error:DEBUG PlaywrightCrawler:SessionPool:Session: Could not set cookies. {"errorMessages":["Cookie not in this host's domain. Cookie:prod.website-files.com Request:wellfound.com"]}
useSessionPool: true, persistCookiesPerSession: true, sessionPoolOptions: { maxPoolSize: 300, sessionOptions:{ maxAgeSecs: 70, maxUsageCount: 2, }, }, launchContext: { ... launchOptions: { bypassCSP: true, acceptDownloads: true,
chromium.use(stealthPlugin())
and without it.const crawler = new PlaywrightCrawler({ ... browserPoolOptions: { useFingerprints: true, fingerprintOptions: { fingerprintGeneratorOptions: { browsers: ['firefox'], operatingSystems: ['linux'], }, }, }, launchContext: { launcher: firefox }, });
useFingerprints: true
, useFingerprintCache: false
, launcher: firefox
pluginContent
string) taken from here: https://discord.com/channels/801163717915574323/1059483872271798333preNavigationHooks: [ async ({ page, request }) => { await page.addInitScript({ content: pluginContent }); },
crawlee/core 3.3.1 playwright 1.33.0 npm: 8.19.3 node: 16.19.0
npm update playwright npm update crawlee
~/.cache/ms-playwright/firefox-1403/
import { firefox } from 'playwright-extra'; import stealthPlugin from 'puppeteer-extra-plugin-stealth'; firefox.use(stealthPlugin());
useFingerprints: true
and launcher: firefox
in code.INFO PlaywrightCrawler: Starting the crawler. An error occured while executing "onPageCreated" in plugin "stealth/evasions/user-agent-override": TypeError: Cannot read properties of undefined (reading 'userAgent') at Proxy.<anonymous> (.../node_modules/playwright-extra/src/puppeteer-compatiblity-shim/index.ts:217:23) at runNextTicks (node:internal/process/task_queues:61:5) at processImmediate (node:internal/timers:437:9) at process.topLevelDomainCallback (node:domain:161:15) at process.callbackTrampoline (node:internal/async_hooks:128:24) at async Plugin.onPageCreated (.../node_modules/puppeteer-extra-plugin-stealth/evasions/user-agent-override/index.js:69:8)
retireBrowserAfterPageCount=2
in browserPoolOptions
: this gives a unique fingerprint every two requests, which... isn't perfect (and starting a new browser instance so often looks strange)If set to true, the crawler will automatically try to bypass any detected bot protection.
Currently supports:
Cloudflare Bot Management
Google Search Rate Limiting
maxRequestRetries=0
- is it OK to use retryOnBlocked
in such case?content = await page.content();
page.content: Target page, context or browser has been closed at (<somewhere-in-my-code>.js:170:54) at PlaywrightCrawler.requestHandler (<somewhere-in-my-code>.js:596:15) at async wrap (.../node_modules/@apify/timeout/index.js:52:21)
page.content()
?useSessionPool: false
and persistCookiesPerSession: false
launcher
and in fingerprintGeneratorOptions
browsers
fingerprintGeneratorOptions
devices: ['desktop']
launchContext: { useIncognitoPages: true }
preNavigationHooks
to fix the "plugin length" problem, as described here: https://discord.com/channels/801163717915574323/1059483872271798333INFO Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,
... new PlaywrightCrawler({ autoscaledPoolOptions: { loggingIntervalSecs: null,
https://www.googletagmanager.com/gtag/js?id=... https://connect.facebook.net/en_US/fbevents.js https://www.google-analytics.com/analytics.js https://fonts.googleapis.com/css?family=Lato
<head> ... <meta name="captcha-challenge" content="1"> ...
const crawler = new PlaywrightCrawler({ ... async failedRequestHandler({request, response, page, log}, error) { ...
ERROR failedRequestHandler: Request failed and reached maximum retries. page.goto: SSL_ERROR_BAD_CERT_DOMAIN
error
argument of the failedRequestHandler with the JSON.stringify(error)
{"name":"Error"}
error
argument.mouse.move: Target page, context or browser has been closed
async requestHandler( {request, response, page, enqueueLinks, log, proxyInfo} ) { ... await sleep( interval ); await page.mouse.move( rnd(100,400), rnd(40,300) ); await sleep( interval ); ... content = await page.content(); }
page.content: Target page, context or browser has been closed
ERROR requestHandler: Request failed and reached maximum retries. page.goto: Navigation failed because page was closed!
preNavigationHooks
, see https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooksblockRequests
https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequestsblockRequests
has some limitations (does it works in incognito mode with Firerox as launcher?). This was discussed in this forum, see:experimentalContainers
thing - in theory it should allow both cache and incognito.const crawler = new PlaywrightCrawler({ ... fingerprintOptions: { fingerprintGeneratorOptions: { locales: [ ... ], ...