sacred-emeraldS
Apify & Crawlee4y ago
10 replies
sacred-emerald

Ways to minimize traffic (save money) when crawling-scraping?

1. Block images, media files and similar things
It can be done either with
preNavigationHooks
, see https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#preNavigationHooks

or with the
blockRequests
https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlingContext#blockRequests

As far as I know,
blockRequests
has some limitations (does it works in incognito mode with Firerox as launcher?). This was discussed in this forum, see:
crawlee-jsHow to avoid requesting some static resources?
crawlee-jsDisable image in playwright

2. Use cache
As far as I understand - you can not have both: cache AND incognito mode.
Well, there is the
experimentalContainers
thing - in theory it should allow both cache and incognito.
I tried it, see PlaywrightCrawler - how often browser fingerprints are changed?
it looks it's not really "incognito" when fingerprint.com recognize you even when your IP is different.
(you can disprove me - may be my test was wrong, who knows?)

3. Something else to reduce traffic?
Please suggest...

4. Actually I care more about money than about traffic...
So one of the ideas - to use "Datacenter proxy" instead of "Residential"...
I see Datacenter proxies for about $0.7 per GB - much cheaper that Residential.
Does it make sense to try?
What is your experience with Datacenter proxies ?
Was this page helpful?