Apify Discord Mirror

Updated 5 months ago

Crawlee vs bot detection systems - Plugins length is not OK

At a glance

The community member tested PlaywrightCrawler on three bot detection sites and found that the sites complained about "0 plugins" or "Plugins length", even though the same sites displayed "5 plugins" when opened with the community member's regular browser (Firefox on Linux). The community member asked if there was an issue with their code and whether Crawlee could emulate the plugin attributes.

In the comments, another community member shared their code that fixed the "Plugins length" error by using preNavigationHooks to inject custom plugin and mime type information. This approach was also shared by other community members, who provided example code and discussed how to handle other bot detection checks like WebGL and hairline feature tests.

The community members worked together to find solutions to bypass the bot detection on the test sites, with some noting that the provided code only fixes the "Plugin length" and "Mime types" issues, and that more comprehensive solutions may be needed to pass all the bot checks.

There is no explicitly marked answer, but the community members collectively provided solutions and suggestions to address the bot detection issues encountered with PlaywrightCrawler.

Useful resources
I tested PlaywrightCrawler on three bot detection sites (see [1], [2], [3] and the attached screenshots).
In all cases these sites complains about "0 plugins" or "Plugins length".

If I open these sites with browser I use every day (Firefox on Linux, by the way - the same as
used in PlaywrightCrawler settings) - these sites say "5 plugins" and the field is green.

Is it something in my code?
Can Crawlee emulate these plugins attributes?

[1] - https://infosimples.github.io/detect-headless/
[2] - https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html
[3] - https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html

and here - part of the PlaywrightCrawler:
Plain Text
const crawler = new PlaywrightCrawler({
    ...
    browserPoolOptions: {
        useFingerprints: true,

        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['firefox'],
                operatingSystems: ['linux'],
            },
        },
    },

    launchContext: {
        launcher: firefox
    },

});


Screenshots:
Attachments
03-webscraping.pro-f1fceabcc55af4353c0da1cddf3e72d7.png
01-infosimples.github.io-19b9a46843518680ccc72bada5fe8b69.png
02-intoli.com-44d20f5d8ce2747086171e4aeecca746.png
3
L
n
A
37 comments
On https://bot.sannysoft.com/
It's OK with my code (different from yours). I get this:
What Url do you use to test?
Attachment
image.png
just advanced to level 2! Thanks for your contributions! ๐ŸŽ‰
Well, here is the code I used to get the "Plugins length" error:
Plain Text
import { firefox, webkit } from 'playwright';
import { PlaywrightCrawler, Dataset, ProxyConfiguration, Request, log, sleep } from 'crawlee';
import { launchPlaywright, playwrightUtils } from 'crawlee';
import * as crypt from 'crypto';

const crawler = new PlaywrightCrawler({
    autoscaledPoolOptions: {
        minConcurrency: 2,
        maxConcurrency: 4,
        loggingIntervalSecs: null,

    },

    maxRequestRetries: 0,
    navigationTimeoutSecs: 130,
    requestHandlerTimeoutSecs: 110,
    useSessionPool: false,
    persistCookiesPerSession: false,
    headless: true,

    browserPoolOptions: {
        useFingerprints: true,
        operationTimeoutSecs: 40,
        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['firefox'],
                operatingSystems: ['linux'],
            },
        },
    },

    launchContext: {
        useIncognitoPages: true,
        launcher: firefox
    },

    async requestHandler( {request, response, page, enqueueLinks, log, proxyInfo} )
    {
        const uniqueKey = crypt.randomBytes(16).toString("hex");
        let url = new URL(request.url);
        let host = url.host;
        let scrFile = `${host}-${uniqueKey}.png`;

        log.info(`GET ${request.url}  Wait1 ...`);
        await sleep(40*1000);

        log.info(`GET ${request.url}  Wait2, Pressing Enter ...`);
        await page.keyboard.press('Enter');
        await sleep(40*1000);

        log.info(`GET ${request.url}  Writing into ${scrFile} ...`);
        await page.screenshot( {path:scrFile, fullPage:true} );
        log.info(`GET ${request.url}  DONE`);
    },
});

await crawler.run([
    "https://infosimples.github.io/detect-headless/",
    "https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html",
    "https://webscraping.pro/wp-content/uploads/2021/02/testresult2.html"
]);
what I want to achieve - to have code/scraper having no "red flags" on bot detection systems like the three sites above AND passing this check: https://nowsecure.nl/ (as far as I understand nowsecure.nl implements a variant of Cloudflare protection).

I'm using Firefox as launcher - it seems, only with Firefox I can pass the nowsecure.nl check
Thanks for this. Can you try with session pool on, not sure if there isn't anything bound to that.

please look into this
just changed useSessionPool to:

Plain Text
useSessionPool: true,


same thing - "Plugins Length: 0"
With the use of chromium instead of firefox as launcher, There is no "Plugins length" error.
Attachment
image.png
I do this hook, for Firefox as launcher, with fingerprint-injector & Playwright [1],

Thus, there are no more "Plugins length" errors.

[1] https://github.com/apify/fingerprint-suite/blob/master/docs/guides/fingerprint-injector.md#usage-with-playwright
Attachment
image.png
Great, so this can be fixed!

But for somebody who is new JS/TS (like me)... would be better to have some example code starting with
Plain Text
 crawler = new PlaywrightCrawler({
 ...
 });

it is possible, isn't it?
just advanced to level 3! Thanks for your contributions! ๐ŸŽ‰
Yes, it's up to you to do the job ๐Ÿ˜‰
- many thanks for the code!!!

It works, it really works!!!
Even with my ugly JS code (please suggest how to improve it) -- it works!!!

I put the JS code creating plugins in the preNavigationHooks - not sure this is the optimal solution...
Attachments
01-infosimples.github.io-584ec122ec20527d880ef7ec3805d68c.png
02-webscraping.pro-311c7989ef9ed30b6407d4498b811594.png
Thanks for the debug. and will eventually check this and see how it can be implemented to Crawlee best
by the way - when fixing "plugin length" - please also fix "0 mime types".
Several sites are checking "mime types length":

https://infosimples.github.io/detect-headless/
under "Mime"

https://browserleaks.com/javascript
search for "mimeTypes"

attached - screenshot from https://browserleaks.com/javascript - made with code above, you can see "mimeTypes: 0"
Attachment
browserleaks.com-mime-types.png
You can do with this
Plain Text
    const pluginContent = `
    Object.defineProperty(navigator, 'plugins', {
        get: () => {
            const PDFPlugin = Object.create(Plugin.prototype, {
                description: { value: 'Portable Document Format', enumerable: false },
                filename: { value: 'internal-pdf-viewer', enumerable: false },
                name: { value: 'PDF Plugin', enumerable: false },
            });
            return Object.create(PluginArray.prototype, {
                length: { value: 1 },
                0: { value: PDFPlugin },
            });
        },
    });
    Object.defineProperty(navigator, 'mimeTypes', {
        get: () => {
            const PDFMimeTypeTxt = Object.create(MimeType.prototype, {
                type: { value: 'text/pdf', enumerable: false },
                suffixes: { value: 'pdf', enumerable: false },
                description: { value: 'Portable Document Format', enumerable: false },
                enabledPlugin: { value: 'PDF Plugin', enumerable: false },
            });
            return Object.create(MimeTypeArray.prototype, {
                length: { value: 1 },
                0: { value: PDFMimeTypeTxt },
            });
        },
    });
    `

attached - screenshot from https://browserleaks.com/javascript - made with code above, you can see mimeTypes: text/pdf, pdf, Portable Document Format
Attachment
image.png
works like a charm!

thanks !!!
Just curious to know how are you generating those plugins? I am using puppeter but getting failed check in bot tests. See screenshot
Attachment
bottest.png
what is interesting: in some cases this code should be in preLaunchHooks and in some cases - in prePageCreateHooks
do not ask me what happens there, I just played a bit ))))

Anyway, attached is my super-mega-PlaywrightCrawler ))) producing 1km of logs (printf debugging, yes) but demonstrating green results for "plugin length" and "mimeTypes"
Thanks a lot. I was able to make it work using Puppeteer.
code:
Plain Text
preNavigationHooks: [
        async ({ page, request }) => {
            log.info(`preNavigationHook: GET=${request.url} START`);
            const preloadFile = fs.readFileSync('./preload.js', 'utf8');
            await page.evaluateOnNewDocument(preloadFile);
            log.info(`preNavigationHook: GET=${request.url} END`);
        }
    ],

preload.js:
Plain Text
Object.defineProperty(navigator, 'plugins', {
    get: () => {
        const PDFPlugin = Object.create(Plugin.prototype, {
            description: { value: 'Portable Document Format', enumerable: false },
            filename: { value: 'internal-pdf-viewer', enumerable: false },
            name: { value: 'PDF Plugin', enumerable: false },
        });
        return Object.create(PluginArray.prototype, {
            length: { value: 1 },
            0: { value: PDFPlugin },
        });
    },
});
Object.defineProperty(navigator, 'mimeTypes', {
    get: () => {
        const PDFMimeTypeTxt = Object.create(MimeType.prototype, {
            type: { value: 'text/pdf', enumerable: false },
            suffixes: { value: 'pdf', enumerable: false },
            description: { value: 'Portable Document Format', enumerable: false },
            enabledPlugin: { value: 'PDF Plugin', enumerable: false },
        });
        return Object.create(MimeTypeArray.prototype, {
            length: { value: 1 },
            0: { value: PDFMimeTypeTxt },
        });
    },
});
just advanced to level 3! Thanks for your contributions! ๐ŸŽ‰
I ran your script on local with proxy servers but I still see these red flags any idea how are you doing to resolve them? I am also figuring out samething.
Attachment
image.png
Attachment
image.png
Well, this JS code:
https://discord.com/channels/801163717915574323/1059483872271798333/1060501044456607774
is fixing only "Plugin length" and "Mime types".


Nothing else.
I was able to resolve all the bot checks using this plugin: https://discord.com/channels/801163717915574323/1051917834290200608/1052147143508500490

only webdriver in frignprint tests and hairline feature test failed rest all passed.
Well... actually code attached to this message https://discord.com/channels/801163717915574323/1059483872271798333/1060959263641567354
has green "webdriver" flag and many other bot checks are also green

Yes, hairline feature... can we ignore it?
I am not sure about hairline feature but I have seen in many youtube videos and few blogs most of them ignore it
With code provided in the following link https://intoli.com/blog/making-chrome-headless-undetectable/, which looks as follows:
Plain Text
    const webGLContent = `
    const getParameter = WebGLRenderingContext.getParameter;
    WebGLRenderingContext.prototype.getParameter = function(parameter) {
      // UNMASKED_VENDOR_WEBGL
      if (parameter === 37445) {
        return 'Intel Open Source Technology Center';
      }
      // UNMASKED_RENDERER_WEBGL
      if (parameter === 37446) {
        return 'Mesa DRI Intel(R) Ivybridge Mobile ';
      }

      return getParameter(parameter);
    };
    `
......
await page.addInitScript({ content: webGLContent });
......

returns the desired values for the renderer and vendor like this
Attachment
image.png
And as indicated in the article, you can also set Retina/HiDPI Hairline Feature.
But as mentioned, "This is another test that doesnโ€™t really make a ton of sense because the majority of people donโ€™t have HiDPI screens and most usersโ€™ browsers wonโ€™t support this feature. "
const webGLContent = ...
Excellent! what we really need is a list of 100-200 such strings and a piece of JS code randomly returning a "webGL string"... (in other words - this functionality should be in the next version of Crawlee)
Thanks a lot for sharing ๐Ÿ™‚
Great research guys, once our team gets more time, we will make sure all of this is implemented by default to Crawlee
any news about this plugin problem?
Hi There is currently PR for this.
I am sorry bad thread, this one is for https://discord.com/channels/801163717915574323/1059916802446073957
Add a reply
Sign up and join the conversation on Discord