Apify Discord Mirror

Updated 2 years ago

Crawlee - how to set timezone?

At a glance
The community member is using a PlaywrightCrawler and is having trouble setting the correct timezone for their location. They mention that the website https://pixelscan.net detects their timezone as "Africa/Abidjan" even though their IP is in a German data center, and the website says they are "spoofing" their location. The comments suggest using the timezoneId property in the playwright.newContext() method to set the correct timezone. One community member provides a working code example using the preLaunchHooks option in the PlaywrightCrawler configuration to set the locale and timezoneId. Another community member suggests setting the launchOptions in the launchContext instead. The community members discuss various approaches and share their experiences, but there is no explicitly marked answer.
Useful resources
Ok, I know in which country are my proxies/IPs, so I can set locale:
Plain Text
const crawler = new PlaywrightCrawler({
    ...
    fingerprintOptions: {
        fingerprintGeneratorOptions: {
            locales: [ ... ],
    ...

BUT! How to set the timezone corresponding to the country?

This is not a theoretical question: this site: https://pixelscan.net
checks timezone, detects "Africa/Abidjan", compares it with my IP in German datacenter
and says "Look like you spoofing your location". (attached - two parts of the huge screenshot made in headless mode with PlaywrightCrawler)

So how to set/control timezone?
Attachment
01-pixelscan.net.png
1
n
L
A
12 comments
part of screenshot from https://pixelscan.net
Attachment
02-pixelscan.net.png
would you pls help me using timezoneId in Crawlee? (I have difficulties connecting Plawright API and Crawlee)

This is configuration of my PlaywrightCrawler:

Plain Text
const crawler = new PlaywrightCrawler({
    autoscaledPoolOptions: {
        minConcurrency: 2,
        maxConcurrency: 4,
        loggingIntervalSecs: null,

    },

    maxRequestRetries: 0,
    navigationTimeoutSecs: 130,
    requestHandlerTimeoutSecs: 110,
    useSessionPool: false,
    persistCookiesPerSession: false,
    headless: true,

    browserPoolOptions: {
        useFingerprints: true,
        operationTimeoutSecs: 40,
        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['firefox'],
                operatingSystems: ['linux'],
                locales: ['de-DE', 'de'],
            },
        },
    },

    launchContext: {
        useIncognitoPages: true,
        launcher: firefox
    },

I think - timezoneId it should be somewhere in the launchContext... but where?
With this working code:
Plain Text
import {
    PlaywrightCrawler,  // https://crawlee.dev/docs/examples/playwright-crawler
    sleep,
} from 'crawlee';
import { firefox } from 'playwright';

const crawler = new PlaywrightCrawler({
    headless: false,
    maxConcurrency: 4,
    minConcurrency: 2,
    maxRequestRetries: 0,
    navigationTimeoutSecs: 130,
    requestHandlerTimeoutSecs: 110,
    useSessionPool: false,
    persistCookiesPerSession: false,
    browserPoolOptions: {
        useFingerprints: true,
        operationTimeoutSecs: 40,
        fingerprintOptions: {
            fingerprintGeneratorOptions: {
                browsers: ['firefox'],
                operatingSystems: ['linux'],
            },
        },
        preLaunchHooks: [
            async (pageId, launchContext) => {
                launchContext.launchOptions.locale = 'en-AU'
                launchContext.launchOptions.timezoneId='Australia/Brisbane'
            }
        ],
    },
    launchContext: {
        useIncognitoPages: false,
        launcher: firefox
    },

    async requestHandler({ request, page, log }) {
        log.info(`GET ${request.url}  DONE`);

        // To get the system's IANA timezone in JavaScript (https://en.wikipedia.org/wiki/List_of_tz_database_time_zones)
        const timezoneFromJavascript = await page.evaluate('Intl.DateTimeFormat().resolvedOptions().timeZone');
        log.info(`Timezone from Javascript: ${timezoneFromJavascript}`)

        if (request.userData.site === 'pixelscan') {
            await sleep(10000);
        }

        const url = new URL(request.url);
        await page.screenshot( {path:`${url.host}.png`, fullPage:true} );
    },
});

await crawler.run([
    { url: "https://pixelscan.net/" , userData: { site: "pixelscan" } }
]);
Another option is to set launchOptions in launchContext like this:
Plain Text
    launchContext: {
        launchOptions: {
            locale: 'en-AU',
            timezoneId: 'Australia/Brisbane'
        },
        useIncognitoPages: false,
        launcher: firefox
    },

IMPORTANT: In all cases, useIncognitoPages must be set to false
awesome! this thing with preLaunchHooks, and with useIncognitoPages: false AND with these settings in fingerprintGeneratorOptions (I set locales additionally):

Plain Text
            fingerprintGeneratorOptions: {
                browsers: ['firefox'],
                operatingSystems: ['linux'],
                locales: ['de-DE'],
            },

-- with all this... it finally works and pixelscan.net believes I'm in Germany (which is true, no proxies used in this test.. yet)... see screenshots
Attachments
pixelscan-3.png
pixelscan-1.png
pixelscan-2.png
- I think, such example should be added to Crawlee documentation.
Without I would never figure out how it works
Yes, I tried to set "launchOptions" this way - does not works! no idea why....
new headless chrome passes all the pixelscan checks as well as creepJS. Do check it out once
When we use proxies then I see location is being spoofed message coming up on pixelscan using new headless. Any idea how to avoid it?
Is that the Web RTC leak perhaps? As I wrote, it will be fixed soonish
Add a reply
Sign up and join the conversation on Discord