Throttle on 429 responses
Hi, I'm using a cheerio crawler and things are generally working well. I occasionally get a Cloudflare 429 page, though, and it manifests itself as an error on
waitForSelector
because I'm getting the Cloudflare response. Should Crawlee be catching these responses and waiting/slowing without intervention? I've had to catch this issue and then pause the autoscale pool (for 10 sec) manually. Should I be tuning other nobs too/instead? I don't have maxRequestsPerMinute configured yet because I'm not sure how to find/tune this setting.2 Replies
Someone will reply to you shortly. In the meantime, this might help:
hey, check out these article for the crawler configuration:
https://crawlee.dev/js/docs/guides/scaling-crawlers
this might also help you with Cloudflare: https://docs.apify.com/academy/anti-scraping and https://docs.apify.com/academy/anti-scraping/mitigation/cloudflare-challenge.md
Scaling our crawlers | Crawlee for JavaScript · Build reliable cra...
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
Anti-scraping protections | Academy | Apify Documentation
Understand the various anti-scraping measures different sites use to prevent bots from accessing them, and how to appear more human to fix these issues.
Bypassing Cloudflare browser check | Academy | Apify Documentation
Learn how to bypass Cloudflare browser challenge with Crawlee.