Ways to minimize traffic (save money) when crawling-scr...

At a glance

The post discusses ways to minimize traffic and save money when crawling and scraping websites. Community members suggest using preNavigationHooks or blockRequests to block unnecessary requests, but note that blockRequests has limitations, especially with Firefox in incognito mode. They also discuss using "Datacenter proxies" instead of "Residential" proxies as a potentially cheaper option. One community member shares their approach to blocking specific resource types to reduce traffic, and provides code examples. There is no explicitly marked answer, but the discussion provides insights and suggestions from the community.

Useful resources

nnew_in_town

It can be done either with preNavigationHooks, see https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks

or with the blockRequests https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequests

As far as I know, blockRequests has some limitations (does it works in incognito mode with Firerox as launcher?). This was discussed in this forum, see:
https://discord.com/channels/801163717915574323/1039557325784105002
https://discord.com/channels/801163717915574323/1019949012415160370

As far as I understand - you can not have both: cache AND incognito mode.
Well, there is the experimentalContainers thing - in theory it should allow both cache and incognito.
I tried it, see https://discord.com/channels/801163717915574323/1060738415370453032/1060952860868739192
it looks it's not really "incognito" when fingerprint.com recognize you even when your IP is different.
(you can disprove me - may be my test was wrong, who knows?)

Please suggest...

So one of the ideas - to use "Datacenter proxy" instead of "Residential"...
I see Datacenter proxies for about $0.7 per GB - much cheaper that Residential.
Does it make sense to try?
What is your experience with Datacenter proxies ?

12 comments

AApifyBot

just advanced to level 4! Thanks for your contributions! 🎉

nnew_in_town

Ways to minimize traffic when crawling-scraping?

nnew_in_town

Ways to minimize traffic (save money) when crawling-scraping?

AAdi

Why not change set of datacenter proxies every 5-6 days?

LLukas Krivka

blockRequests doesn't work in Firefox at all sadly

Generally, you reduce traffic and compute by scraping via API which is possible on most modern websites. Those who have all in HTML are heavier but you almost never need browser. https://developers.apify.com/academy/api-scraping

Browser is so much heavier that none of the above things really matter if you don't need it

nnew_in_town

by the way:

I want to avoid downloading unnecessary files (unnecessary requests) so I'm using
the method described here:
https://discord.com/channels/801163717915574323/1039557325784105002

Here https://playwright.dev/docs/api/class-request#request-resource-type (Playwright documentation) is the list of resource types to check:

Plain Text

document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, other.

BUT!
Looking here
https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest/ResourceType
I see much much more resource types. Things like

Plain Text

"csp_report", "beacon", "imageset", "ping"

and many many others.

Should I include elements both lists in my BLOCKED array? (Imagine I'm paranoid and want to block everything except the main document)

LLukas Krivka

Sure, why not. Just keep in mind that the website might not work properly if you block some stuff.

nnew_in_town

...and here the resource types I block:

Plain Text

const BLOCKED_IMG        =                                                               ['image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];

const BLOCKED_IMG_CSS    =                                                 ['stylesheet', 'image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];

const BLOCKED_IMG_CSS_JS = ['websocket', 'xhr', 'xmlhttprequest', 'script', 'stylesheet', 'image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];

For a new site I try to block everything in BLOCKED_IMG_CSS_JS.
In case the site does not work (pages not rendered properly) I try the BLOCKED_IMG_CSS. If it is still not work - BLOCKED_IMG and than - no block at all.

Feel free to use/improve this approach.
Share your improvements :-)

nnew_in_town

and the code that implements blocking:

Plain Text

    preNavigationHooks: [

        async ({ page, request }) => {
            await page.route('**/*', (route) => {
                if ( (request.userData.headLessImg==='noimg') &&  (BLOCKED_IMG.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                if ( (request.userData.headLessImg==='noimgnocss') &&  (BLOCKED_IMG_CSS.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                if ( (request.userData.headLessImg==='noimgnocssnojs') &&  (BLOCKED_IMG_CSS_JS.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                return route.continue();
            });
        },
    ],

So I set the request.userData.headLessImg per request and the code in preNavigationHooks just check the value....

AAdi

Great thanks for sharing. If possible Can you please respond on DM?

nnew_in_town

write DM again (can not see it now)

AApifyBot

just advanced to level 6! Thanks for your contributions! 🎉

Add a reply

Apify Discord Mirror

Ways to minimize traffic (save money) when crawling-scraping?