Apify Discord Mirror

Home
Members
new_in_town
n
new_in_town
Offline, last seen 3 months ago
Joined August 30, 2024
I migrated my scraping code from Crawlee to Hero (see https://github.com/ulixee/hero). It works. Everything that worked with Crawlee - works with Hero.

Why I migrated: can not handle the over-engineered Crawlee API more (and bugs related to this).
It was just too much APIs (different APIs!) for my simple case.
Hero has about 5 times simpler API.

In both cases (Crawlee and Hero) I am using only scraping library, no additional (cloud) services, no docker containers.

I am not manipulating DOM, not doing retries, not doing any complex things in Typescript. I am just accessing the URL (in some cases the URL1 and after this the URL2 to pretentd I'm normal user), grab the rendered HTML and that's it. All the HTML manipulations (extracting the data from the HTML) done in completely different program (written in different programming language, not in Typescript).
Re-try logic -> again, this is implemented in that different program.

I use beanstalkd (see https://github.com/beanstalkd/beanstalkd/) message queue between that "different program" and the scraper. So I just replaced the Crawlee-based-scraper with Hero-based-scraper without touching other parts of the system. Usage of beanstalkd was already discussed in this forum: use search to find these discussions.

Goodbye Crawlee.
3 comments
J
S
M
I immediately get captcha on every URL.
Accessing it in a normal GUI browser typing site homepage URL: captcha.
Searching this site in google, clicking on the link in google results: browser shows site address and...: captcha.

(by the way, they changed it, few months ago this site was not that restrictive)

Well... what is our best solution for sites always showing cloudflare captchas ?
2 comments
P
O
I am using PlaywrightCrawler with Firefox. When accessing wellfound.com and see this error:

Plain Text
DEBUG PlaywrightCrawler:SessionPool:Session: Could not set cookies. {"errorMessages":["Cookie not in this host's domain. Cookie:prod.website-files.com Request:wellfound.com"]}

It might be that this cookie is something important: I'm navigating to another page on this site and get HTTP 403 and captcha...
How to fix this error?

Have these settings in code:
Plain Text
        
    useSessionPool: true,
    persistCookiesPerSession: true,
    sessionPoolOptions: {
         maxPoolSize: 300,
         sessionOptions:{
             maxAgeSecs: 70,
             maxUsageCount: 2,
         },
     },
     
    launchContext: {
        ...
        launchOptions: {
            bypassCSP: true,
            acceptDownloads: true,     
1 comment
P
I have a program: Playwright+Crawlee+Firefox+rotating proxies used to scrape jobs from wellfound.com In may 2024 (and earlier) it worked quite well, many months, despite captcha protection on site.

Today I get HTTP 403 and captcha (from ct.captcha-delivery.com). My code is not changed!

Proxies: iproyal.com, "residential-proxies", session time 1 min ("sticky session"). What I did: in the same session accessed URL1 and than URL2. URL1 has no captcha, URL2 contains info I need, and is/was protected with captcha. In the past the trick with "URL1 and than URL2 in the same session" worked well. Today I get captcha when accessing URL2.

What I tried: switched between Chrome and Firefox in my code. For Chrome tried with chromium.use(stealthPlugin()) and without it.

Still see that captcha. Tried to access the site with normal GUI browser (Firefox) through iproyal.com "sticky session": accessing URL1 and than URL2: no captcha.
It means: proxies are still OK, they are not detected!

Bottom line: something changed, bot detection improved.
What is our answer?

Is it something similar to this: https://discord.com/channels/801163717915574323/1293244368249032895/1293244368249032895
@Jeno what solution you found?
1 comment
n
In the PlaywrightCrawler.requestHandler() I can access 'log' because it is an argument of requestHandler()
How can I access log (or something similar?) in other places?

Example:
I want to log something before the crawler.run();

(well, console.log works but I would like to control the loglevel in one place...)
2 comments
A
A
I'm running this simple program from a server in German datacenter with IP 167.235...
This program uses US residential proxies (rotating every 1min).

And I see that pixelscan.net is able to detect my original IP: 167.235...
On the attached screenshot you can find it under "WebRTC address"

So how to avoid this?

P.S:
and another problem I see - "Plugins Length", it is discussed here https://discord.com/channels/801163717915574323/1059483872271798333
19 comments
n
L
P
p
L
I tested PlaywrightCrawler on three bot detection sites (see [1], [2], [3] and the attached screenshots).
In all cases these sites complains about "0 plugins" or "Plugins length".

If I open these sites with browser I use every day (Firefox on Linux, by the way - the same as
used in PlaywrightCrawler settings) - these sites say "5 plugins" and the field is green.

Is it something in my code?
Can Crawlee emulate these plugins attributes?

[1] - https://infosimples.github.io/detect-headless/
[2] - https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
[3] - https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html

and here - part of the PlaywrightCrawler:
Plain Text
const crawler = new PlaywrightCrawler({
    ...
    browserPoolOptions: {
        useFingerprints: true,

        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['firefox'],
                operatingSystems: ['linux'],
            },
        },
    },

    launchContext: {
        launcher: firefox
    },

});


Screenshots:
37 comments
1
P
n
L
A
L
I checked my program (PlaywrightCrawler) against this thing: https://amiunique.org/fingerprint
Used US residential proxy, did 3 screenshots, see below
It seems - there are some areas where Crawlee could do better (be less unique, less detectable)!

Here the list (these things are red on the screenshots):
  • User Agent (I used fingerprint generator for this!)
  • Canvas
  • Navigator properties
  • List of fonts
  • List of plugins
  • Permissions
Some settings in my PlaywrightCrawler:
useFingerprints: true, useFingerprintCache: false, launcher: firefox

Regarding list of plugins: I use some JS code (pluginContent string) taken from here: https://discord.com/channels/801163717915574323/1059483872271798333
and inject it into page this way:
Plain Text
    preNavigationHooks: [
        async ({ page, request }) => {
            await page.addInitScript({ content: pluginContent });
        },


Well, this code/hack... it simulates presence of some PDF plugins... but I have an impression there are better solutions for plugins/fonts/permissions...
10 comments
L
n
A
A
P
6 month ago I created a Crawlee+Playwright+"node-beanstalk"(a JS wrapper for Beanstalkd message queue) project. I was following Crawlee documentation, created some... template? and started to add things to this template (no Docker image was used. I just installed things on Ubuntu machine).
And somehow it works (and it still wonders me)))

This is the versions used at the moment (my package.json is below, feel free to take look/criticize, i know it is not perfect):
Plain Text
   crawlee/core 3.3.1
   playwright 1.33.0

   npm:  8.19.3
   node: 16.19.0


Now I see that the latest Crawlee version is 3.5, latest Playwright is 1.39 and may be some other packages are updated. It is time to update.

So, what is the proper way to update Crawlee and Playwright in such project?
Is it just this:
Plain Text
   npm update playwright
   npm update crawlee

Or something else?

I use headless Firefox it is installed here:
~/.cache/ms-playwright/firefox-1403/
How to update it?

Disclaimer: i am not a JS developer, i am Java developer who somehow writes JS code (lot of copy/paste, yes) so I know that dependency management is not that easy, so I think it is better to ask in this forum than create a mess is my project...
4 comments
L
n
W
Imagine the request queue of Crawlee (PlaywrightCrawler) containing URLs of two (or more) sites:

example.com/url1
another-site.com/url2
example.com/url3
another-site.com/url4
...

I would like to configure Crawlee to have per-site interval between requests. For the above example it means:

example.com: 20 sec (or more) between requests
another-site.com: 60 sec (or more) between requests

How to do this with Crawlee?
1 comment
H
I have some code using PlaywrightCrawler. I added "playwright-extra" with "stealthPlugin" to this code. Exactly as in documentation [1]

I added to my code only this:
Plain Text
import { firefox } from 'playwright-extra';
import stealthPlugin from 'puppeteer-extra-plugin-stealth';
firefox.use(stealthPlugin());

The rest of program remains the same as before. And I have useFingerprints: true and launcher: firefox in code.

Well, the code works. Bot detection sites report that my crawler has 3 plugins and supports 4 mime types, so something changed.
But! I got this is the stdout:
Plain Text
INFO  PlaywrightCrawler: Starting the crawler.
An error occured while executing "onPageCreated" in plugin "stealth/evasions/user-agent-override": TypeError: Cannot read properties of undefined (reading 'userAgent')
    at Proxy.<anonymous> (.../node_modules/playwright-extra/src/puppeteer-compatiblity-shim/index.ts:217:23)
    at runNextTicks (node:internal/process/task_queues:61:5)
    at processImmediate (node:internal/timers:437:9)
    at process.topLevelDomainCallback (node:domain:161:15)
    at process.callbackTrampoline (node:internal/async_hooks:128:24)
    at async Plugin.onPageCreated (.../node_modules/puppeteer-extra-plugin-stealth/evasions/user-agent-override/index.js:69:8)

How bad is this?


[1] https://crawlee.dev/docs/examples/crawler-plugins
8 comments
d
D
H
L
n
Hi all,
what I want to achieve:

  • every request should have unique fingerprint - this is important!
  • cookies, etc. not shared between requests
  • PlaywrightCrawler
  • no sessions - every request is independent, (no login or similar)
  • Firefox
  • performance/throughput is not a number one prio
At the moment I almost have this with the hack retireBrowserAfterPageCount=2 in browserPoolOptions: this gives a unique fingerprint every two requests, which... isn't perfect (and starting a new browser instance so often looks strange)

In this thread: https://discord.com/channels/801163717915574323/1060467542616965150
a solution for browser pool (without crawler) was suggested.

I would like to have both: new fingerprint per request and PlaywrightCrawler.
Is it possible?
1 comment
L
Actually this should be a great thing!

https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#retryOnBlocked

If set to true, the crawler will automatically try to bypass any detected bot protection.
Currently supports:
Cloudflare Bot Management
Google Search Rate Limiting

Can we have some information about ... how to use this thing?
Any prerequisites? Side effects?
Does it needs some special settings in PlaywrightCrawler ?
Example: I have maxRequestRetries=0 - is it OK to use retryOnBlocked in such case?
1 comment
L
This exception happens in about 15-20% of all requests... quite often!

This line in code:
Plain Text
content = await page.content();


Throws this exception
Plain Text
page.content: Target page, context or browser has been closed
   at (<somewhere-in-my-code>.js:170:54)
   at PlaywrightCrawler.requestHandler (<somewhere-in-my-code>.js:596:15)
   at async wrap (.../node_modules/@apify/timeout/index.js:52:21)


Is it something well known?

Should I check(wait for) something before calling page.content() ?
It is already checked that status of response.status() is less than 400 (it is actually 200, i see it in the logs)
7 comments
L
n
Got captha and HTTP 403 when accessing wellfound.com

I get captcha all the time when I access links like these (basically - accessing any job ad on wellfound):
https://wellfound.com/company/kalepa/jobs/2651640-tech-lead-manager-full-stack-europe
https://wellfound.com/company/pinatacloud/jobs/2655889-principal-software-engineer
https://wellfound.com/company/wingspanapp/jobs/2629420-senior-software-engineer

Screenshot attached.


and this is not Cloudflare protection - it's some other anti-bot thing.

I am using:
  • US residential proxies from smartproxy.com
  • PlaywrightCrawler with useSessionPool: false and persistCookiesPerSession: false
  • headless Firefox, both as launcher and in fingerprintGeneratorOptions browsers
  • my locale is en-US, timezone in America/New_York (to match US proxies)
  • in fingerprintGeneratorOptions devices: ['desktop']
  • in launchContext: { useIncognitoPages: true }
  • I set pluginContent in preNavigationHooks to fix the "plugin length" problem, as described here: https://discord.com/channels/801163717915574323/1059483872271798333
And still this site detects me as robot!
Any ideas how to overcome this?

UPDATE1: the IP on screenshot is somewhere in US/Texas...

UPDATE2: when I open these links in my desktop browser incognito mode - I get this captcha too...
11 comments
n
M
O
I see these messages on the console
Plain Text
 INFO  Statistics: PlaywrightCrawler request statistics: {"requestAvgFailedDurationMillis":null,


how can I disable it?

P.S.

I already have this:
Plain Text
... new PlaywrightCrawler({
    autoscaledPoolOptions: {
        loggingIntervalSecs: null,
3 comments
n
L
I already block images as described in [1] and this helps to save some bandwith.
Next step: looking at statistics in my proxy service I see a significant number of requests like these:

Plain Text
https://www.googletagmanager.com/gtag/js?id=...
https://connect.facebook.net/en_US/fbevents.js
https://www.google-analytics.com/analytics.js
https://fonts.googleapis.com/css?family=Lato


Can somebody show me an example of code blocking these domains? (better: to block all domains from a given list)

I assume it should be something in PlaywrightCrawler.preNavigationHooks, right?
Prerequisites: PlaywrightCrawler, Firefox as launcher (Chrome-specific hacks probably would not work)

(I'm not good at writing Javascript from scratch, so need some help)

[1] https://discord.com/channels/801163717915574323/1060986956961546320
1 comment
L
How to detect captcha?

I see this in the response HTML:
Plain Text
<head>
  ...
  <meta name="captcha-challenge" content="1">
  ... 

but I would prefer to use some function in Playwright/Crawlee.
I mean, some generic way to detect captcha - who knows which variant of captha I will get in the future....

I can not use HTTP status - this page returns status=200 but it shows captcha!
1 comment
L
I am using PlaywrightCrawler and the failedRequestHandler to handle errors.
Something like this:
Plain Text
const crawler = new PlaywrightCrawler({
    ...
    async failedRequestHandler({request, response, page, log}, error) {

    ...

And sometimes I see errors in the log:
Plain Text
ERROR failedRequestHandler: Request failed and reached maximum retries. page.goto: SSL_ERROR_BAD_CERT_DOMAIN


But! when I am looking inside the error argument of the failedRequestHandler with the JSON.stringify(error)
I see only this: {"name":"Error"}

It seems, the detailed error message I see in the log is not accessible in the error argument.

So, how to access the detailed error message in code?
8 comments
n
L
A
One of the pages I want to scrape with PlaywrightCrawler returns the SSL_ERROR_BAD_CERT_DOMAIN error.
I can reproduce this error when I open this URL in Firefox/Chrome - I see the browser shows the prompt with
the warning and asks "...do you want to proceed?"

So the error is from the browser, not from Crawlee/Playwright...

But... Firefox has so many flags/settings... may be I can somehow set the flag
"accept all certificates"?
2 comments
n
In the PlaywrightCrawler.requestHandler I calling page.mouse.move and
sometimes I get this error: mouse.move: Target page, context or browser has been closed

Here the sequence of calls:
Plain Text
async requestHandler( {request, response, page, enqueueLinks, log, proxyInfo} )
{
    ...
    await sleep( interval );
    await page.mouse.move( rnd(100,400), rnd(40,300) );
    await sleep( interval );
    ...
    content = await page.content();
}


In case I catch the exception thrown in page.mouse.move and contine - than I get
almost the same thing when calling page.content():
page.content: Target page, context or browser has been closed

I would like to move mouse randomly - I think, the make my scraper "human-like".
But something is going wrong here and I can not figure out what.

Sometimes this code works and sometimes I see these errors!
Pls help!

UPDATE:
and sometimes the error message is:
ERROR requestHandler: Request failed and reached maximum retries. page.goto: Navigation failed because page was closed!
I think, somehow these error messages are related...
4 comments
L
n
A

It can be done either with preNavigationHooks, see https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks

or with the blockRequests https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequests

As far as I know, blockRequests has some limitations (does it works in incognito mode with Firerox as launcher?). This was discussed in this forum, see:
https://discord.com/channels/801163717915574323/1039557325784105002
https://discord.com/channels/801163717915574323/1019949012415160370


As far as I understand - you can not have both: cache AND incognito mode.
Well, there is the experimentalContainers thing - in theory it should allow both cache and incognito.
I tried it, see https://discord.com/channels/801163717915574323/1060738415370453032/1060952860868739192
it looks it's not really "incognito" when fingerprint.com recognize you even when your IP is different.
(you can disprove me - may be my test was wrong, who knows?)


Please suggest...


So one of the ideas - to use "Datacenter proxy" instead of "Residential"...
I see Datacenter proxies for about $0.7 per GB - much cheaper that Residential.
Does it make sense to try?
What is your experience with Datacenter proxies ?
12 comments
L
A
n
A
Are browser fingerprints changed
  • every request?
  • every 1 min?
  • every... I do not what else ))
And how changing browser fingerprints related
to using or not using PlaywrightCrawler.launchContext.useIncognitoPages ?

I am asking this because I saw a situation when two attempts to open a bot detection site
https://fingerprint.com/demo/ result in same "ID" - in other words they were
able to identify me! Screenshots attached.

Interval between requests - 3 min.
Different IPs (from the pool of "rotating" IP's).
Without incognito
8 comments
n
L
A
For developers building scrapers/crawlers with Crawlee library - which proxy services you are using?

  • Is it possible to use "US residential proxies" ?
  • What do you think about quality of service?
  • What about price?
18 comments
1
L
w
A
H
n
Ok, I know in which country are my proxies/IPs, so I can set locale:
Plain Text
const crawler = new PlaywrightCrawler({
    ...
    fingerprintOptions: {
        fingerprintGeneratorOptions: {
            locales: [ ... ],
    ...

BUT! How to set the timezone corresponding to the country?

This is not a theoretical question: this site: https://pixelscan.net
checks timezone, detects "Africa/Abidjan", compares it with my IP in German datacenter
and says "Look like you spoofing your location". (attached - two parts of the huge screenshot made in headless mode with PlaywrightCrawler)

So how to set/control timezone?
12 comments
L
A
n
L