ThePhantom

Playwright newContext() in incognito mode

Hey,

I'm facing this issue:
Error: Function newContext() is not available in incognito mode
at PlaywrightBrowser.newContext (xxxxx\node_modules@crawlee\browser-pool\playwright\playwright-browser.js:69:15)

Here's the code that triggers it:

Plain Text

const browser = await launchPlaywright();

try {
    const context = await browser.newContext({
    ...

As per Playwright docs:

Plain Text

Playwright allows creating "incognito" browser contexts with browser.newContext() method. "Incognito" browser contexts don't write any browsing data to disk.

ref: https://playwright.dev/docs/api/class-browsercontext

Why would it be not allowed in Crawlee if Playwright supports it?

1 comment

TThePhantom

Catch and solve captchas

Hey,

I'm facing some captchas, reCaptcha V2 in this case. After I solve the captcha, then it'll 'mark' me as 'safe' and I can continue scraping. But I'm wondering how should I approach getting the captcha programmatically, solve it and send back the required response. This way I can run on a server and 'whitelist' it's IP or do the same for proxies(it keeps throwing captchas on proxies too!).

Just not sure how to do all this in code.

7 comments

TThePhantom

Failed requests - Session closed 'without receiving a SETTINGS frame' or 'NGHTTP2_REFUSED_STREAM'

Hey,

I'm playing around with a CheerioCrawler and I've noticed requests failing due to the errors in the title. I'm wondering if it has something to do with my setup(pretty straight-forward and tested before without issues), the source(had no issue with it before as well) or it's something else that I'm missing.
Has anyone faced one or both of these errors before?

2 comments

TThePhantom

enqueueLinks with a selector doesn't work?

I'm trying to grab the next page link from: https://www.haskovo.net/news with:

Plain Text

await enqueueLinks({
        selector: '.pagination li:last-child > a',
        label: 'LIST',
    })

But it won't work. I've checked this(+ other selectors) in DevTools and it grabs the element fine.

What am I missing?

PS: I'm just messing around, trying to get the grasp of things. I'm aware that I can grab the whole thing with Cheerio, but I want a 'proof of concept' with PlaywrightCrawler.

9 comments

TThePhantom

Proxy fails on SSL secured(httpS) websites

Hey!

I'm trying different proxy providers and I've noticed the issue in the title.

I'm setting the proxy in

Plain Text

proxyUrls

in the following format:

Plain Text

http://user:pass@host:port

as I usually do. But with the current providers I'm testing, the request will fail with either 407 (Proxy Authentication Required) or 422 responses.

Strangely enough, if tried with

Plain Text

curl -x 'proxy string from the same providers, in the same format' https://example.com

- it works.

Any idea what could be causing it?

10 comments

TThePhantom

Resume after crash

Hey,

I've had a Cheerio crawler running for couple of hours, but it crashed. I'm wondering if it's possible to renew the crawl from the place it stopped at. I can see there are some files left in the

Plain Text

key_value_stores

dir:

3 comments

Apify Discord Mirror

Playwright newContext() in incognito mode

Catch and solve captchas

Failed requests - Session closed 'without receiving a SETTINGS frame' or 'NGHTTP2_REFUSED_STREAM'

enqueueLinks with a selector doesn't work?

Proxy fails on SSL secured(httpS) websites

Resume after crash