Apify Discord Mirror

Updated 5 months ago

pass the cloudflare browser check

At a glance

The community member is having trouble passing the Cloudflare browser check when using the crawlee playwrightCrawler to scrape the site https://www.g2.com/. They have tried various methods like using residential proxies, different browsers, and headful/headless configurations, but nothing has worked. Another community member suggests using Playwright with Firefox, but the provided run still returns a 403 error. The discussion reveals that Cloudflare has different protection modes, and the solution seems to be using CheerioCrawler with Playwright:Firefox, where the Firefox browser is used to solve the JavaScript challenge, and the cookies and headers are saved and reused for subsequent requests. The community members also discuss the importance of the order of headers and using internal fetch calls from within the browser to bypass the protection. Finally, one community member claims to have solved the issue, but does not provide the solution in the comments.

Useful resources
Anybody know how to pass cloudflare browser check with crawlee playwrightCrawler?
site I have problem with: https://www.g2.com/
I have tried residential proxies, no proxies, chrome and firefox browser, headful or headless but nothing works.
My chrome browser passes the check with no proxies and residential proxies too, so I guess proxy is not the problem. The problem is that cloudflare somehow knows that it is automated browser.
In apify store there is working scraper for g2 but it is written in python but atleast I know it is possible to do it.
3
A
H
A
19 comments
Playwright with FF should work, I used it to bypass CF few months ago, pls share run or snapshot, make sure you await for some real content control, on page load there will be CF checkup running for a while
here is the run https://console.apify.com/view/runs/UHYyD8JIrtj68ePpW but there is not much to see, it just returns 403
I got through CF many times with playwright but now it looks like they have improved the protection.
Looks like some IPs working and some are not, content reached under Chrome after two retries: https://console.apify.com/view/runs/MWefTdPk6wfZZ3rz5
just advanced to level 12! Thanks for your contributions! πŸŽ‰
I took your config and just changed the url to https://www.g2.com/products/monday-com-monday-com/reviews and number of retries to 20 but no luck. https://console.apify.com/view/runs/3RQyInzk9aQ0SEOJS
Any suggestions would be greatly appreciated.
could add it to his repository of Cloudflare sites
We chatted about this in private with , as I encountered CF blocking too...

If I understand it correctly, CF has two modes of bot protection (with kinda confusing names TBH)
  • a) Bot Management – basic
  • b) Super Bot Fight Mode – advanced
The sites I’m scraping seems to use A) The solution to that seems to be pretty easy:
  • using CheerioCrawler with playwright:firefox Dockerfile
  • in createSessionFunction: open Firefox (via Playwright), goto the site, let the Firefox solve the Javascript challenge, and save all the cookies and request headers to the session.
  • in preNavigationHooks: get stored cookies/headers from session and set them to gotScraping
This solution works for me both locally and on Apify platform, without any proxies used. Beware that it probably only works for sites that use the basic bot protection mode.
What do you use for debugging with MITM?
Is it mitmproxy?
https://mitmproxy.org/
mitmproxy would probably work too, but I like nice things so I use https://proxyman.io/ πŸ˜„
just advanced to level 1! Thanks for your contributions! πŸŽ‰
It was crucial for me for discovering it's the headers order that causes the issue
Attachments
CleanShot_2022-11-13_at_21.28.02.png
CleanShot_2022-11-13_at_21.28.14.png
Wow, never heard about headers order messing things up
I used the same approach but with internal fetch calls from inside browser, imho might be more reliable since they should be doing something logically equal to "heartbeat" checkup to see if web visitor still online
This also should be working regardless of their internal protection mode: if page context reached then fetch expected to work, otherwise they (CF) will not be able to support web apps
it was first time for me too, but it's probably not too uncommon as there's logic exactly for this in header-generator https://github.com/apify/header-generator/blob/master/src/header-generator.ts#L208
Attachments
CleanShot_2022-11-16_at_07.21.122x.png
CleanShot_2022-11-16_at_07.20.592x.png
Thanks, good to keep in mind
Add a reply
Sign up and join the conversation on Discord