Apify Discord Mirror

Updated 2 years ago

Ways to minimize traffic (save money) when crawling-scraping?

At a glance
The post discusses ways to minimize traffic and save money when crawling and scraping websites. Community members suggest using preNavigationHooks or blockRequests to block unnecessary requests, but note that blockRequests has limitations, especially with Firefox in incognito mode. They also discuss using "Datacenter proxies" instead of "Residential" proxies as a potentially cheaper option. One community member shares their approach to blocking specific resource types to reduce traffic, and provides code examples. There is no explicitly marked answer, but the discussion provides insights and suggestions from the community.
Useful resources

It can be done either with preNavigationHooks, see https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks

or with the blockRequests https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequests

As far as I know, blockRequests has some limitations (does it works in incognito mode with Firerox as launcher?). This was discussed in this forum, see:
https://discord.com/channels/801163717915574323/1039557325784105002
https://discord.com/channels/801163717915574323/1019949012415160370


As far as I understand - you can not have both: cache AND incognito mode.
Well, there is the experimentalContainers thing - in theory it should allow both cache and incognito.
I tried it, see https://discord.com/channels/801163717915574323/1060738415370453032/1060952860868739192
it looks it's not really "incognito" when fingerprint.com recognize you even when your IP is different.
(you can disprove me - may be my test was wrong, who knows?)


Please suggest...


So one of the ideas - to use "Datacenter proxy" instead of "Residential"...
I see Datacenter proxies for about $0.7 per GB - much cheaper that Residential.
Does it make sense to try?
What is your experience with Datacenter proxies ?
1
A
n
A
12 comments
just advanced to level 4! Thanks for your contributions! πŸŽ‰
Ways to minimize traffic when crawling-scraping?
Ways to minimize traffic (save money) when crawling-scraping?
Why not change set of datacenter proxies every 5-6 days?
  1. blockRequests doesn't work in Firefox at all sadly
  1. Generally, you reduce traffic and compute by scraping via API which is possible on most modern websites. Those who have all in HTML are heavier but you almost never need browser. https://developers.apify.com/academy/api-scraping
Browser is so much heavier that none of the above things really matter if you don't need it
by the way:

I want to avoid downloading unnecessary files (unnecessary requests) so I'm using
the method described here:
https://discord.com/channels/801163717915574323/1039557325784105002

Here https://playwright.dev/docs/api/class-request#request-resource-type (Playwright documentation) is the list of resource types to check:
Plain Text
document, stylesheet, image, media, font, script, texttrack, xhr, fetch, eventsource, websocket, manifest, other.


BUT!
Looking here
https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/WebExtensions/API/webRequest/ResourceType
I see much much more resource types. Things like
Plain Text
"csp_report", "beacon", "imageset", "ping" 

and many many others.

Should I include elements both lists in my BLOCKED array? (Imagine I'm paranoid and want to block everything except the main document)
Sure, why not. Just keep in mind that the website might not work properly if you block some stuff.
...and here the resource types I block:
Plain Text
const BLOCKED_IMG        =                                                               ['image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];

const BLOCKED_IMG_CSS    =                                                 ['stylesheet', 'image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];

const BLOCKED_IMG_CSS_JS = ['websocket', 'xhr', 'xmlhttprequest', 'script', 'stylesheet', 'image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt',  'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];


For a new site I try to block everything in BLOCKED_IMG_CSS_JS.
In case the site does not work (pages not rendered properly) I try the BLOCKED_IMG_CSS. If it is still not work - BLOCKED_IMG and than - no block at all.

Feel free to use/improve this approach.
Share your improvements :-)
and the code that implements blocking:

Plain Text
    preNavigationHooks: [

        async ({ page, request }) => {
            await page.route('**/*', (route) => {
                if ( (request.userData.headLessImg==='noimg') &&  (BLOCKED_IMG.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                if ( (request.userData.headLessImg==='noimgnocss') &&  (BLOCKED_IMG_CSS.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                if ( (request.userData.headLessImg==='noimgnocssnojs') &&  (BLOCKED_IMG_CSS_JS.includes(route.request().resourceType())) )
                {
                    return route.abort();
                }
                return route.continue();
            });
        },
    ],


So I set the request.userData.headLessImg per request and the code in preNavigationHooks just check the value....
Great thanks for sharing. If possible Can you please respond on DM?
write DM again (can not see it now)
just advanced to level 6! Thanks for your contributions! πŸŽ‰
Add a reply
Sign up and join the conversation on Discord