extended-salmon
extended-salmon•3y ago

There is a major problem, Crawlee is unable to bypass the cloudflare protecti...

@Helper @gahabeen There is a major problem, Crawlee is unable to bypass the cloudflare protection (captcha solution tried 5 times) useChrome method was tried and failed. Manual login was successful when done on Chrome (out of Node and also tried with incognito mode, etc.) https://abrahamjuliot.github.io/creepjs/ Despite Crawlee receiving a higher trust score from the Chrome browser I am currently using, it is unable to pass the cloudflare page.
No description
28 Replies
extended-salmon
extended-salmonOP•3y ago
No description
Lukas Krivka
Lukas Krivka•3y ago
1. Try it with Playwright + Firefox 2. Make sure you have high quality proxies. But local IP should also be good if you can open it normally 3. Try with Crawler
HonzaS
HonzaS•3y ago
there is thread with some suggestions https://discord.com/channels/801163717915574323/1039611311467810856/1041684802052562974 but as far as I know for some pages no approach from crawlee really works and you always get captcha
wise-white
wise-white•3y ago
Do yoy try with BasicCrawler (got-scraping library)? https://crawlee.dev/docs/guides/got-scraping
Got Scraping | Crawlee
Blazing fast cURL alternative for modern web scraping
extended-salmon
extended-salmonOP•3y ago
3. Fail: WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 403 status code. Chromium worked perfectly with puppeteer-extra's StealthPlugin (it redirected to the main content without needing to solve the Cloudflare captcha):
const puppeteerVanilla = require("puppeteer");
const { addExtra } = require("puppeteer-extra");
const puppeteer = addExtra(puppeteerVanilla);

const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());

// Main function
puppeteer.launch({ headless: false }).then(async (browser) => {
const page = await browser.newPage();
await page.goto("https://chat.openai.com/");
});
const puppeteerVanilla = require("puppeteer");
const { addExtra } = require("puppeteer-extra");
const puppeteer = addExtra(puppeteerVanilla);

const StealthPlugin = require("puppeteer-extra-plugin-stealth");
puppeteer.use(StealthPlugin());

// Main function
puppeteer.launch({ headless: false }).then(async (browser) => {
const page = await browser.newPage();
await page.goto("https://chat.openai.com/");
});
Also PuppeteerCrawler worked with puppeteer-extra's StealthPlugin:
import { PuppeteerCrawler } from "crawlee";
import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

// Main function
const crawler = new PuppeteerCrawler({
launchContext: {
launcher: puppeteer.launch({ headless: false }).then(async (browser) => {
const page = await browser.newPage();
await page.goto("https://chat.openai.com/");
}),
},
});
import { PuppeteerCrawler } from "crawlee";
import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

// Main function
const crawler = new PuppeteerCrawler({
launchContext: {
launcher: puppeteer.launch({ headless: false }).then(async (browser) => {
const page = await browser.newPage();
await page.goto("https://chat.openai.com/");
}),
},
});
Lukas Krivka
Lukas Krivka•3y ago
@petrpatek. Can you looks into this?
HonzaS
HonzaS•3y ago
thanks I will try puppeteer-extra's StealthPlugin
extended-salmon
extended-salmon•3y ago
stealth plugin works awesome for CF bypassing. Using vanilla puppeteer is not a good option for scraping & crawling since it's easy to detect the browser is driven by a script due to the fingerprinting creep.js can be used to see the browser's trust score
Lukas Krivka
Lukas Krivka•3y ago
Vanilla Crawlee should be better than puppeteer stealth, if it is not, we need to fix it
stormy-gold
stormy-gold•3y ago
you mean the PuppeteerCrawler from Crawlee? is the userFingerprints set by default or should be set explicitly?
const crawler = new PuppeteerCrawler({
......
browserPoolOptions: {
useFingerprints: true,
const crawler = new PuppeteerCrawler({
......
browserPoolOptions: {
useFingerprints: true,
btw, I also tried the StealthPlugin, I didn't feel it improved anything. ymmv
extended-salmon
extended-salmonOP•3y ago
I also tried the StealthPlugin, I didn't feel it improved anything.
Did you say for Cloudflare? Could your IP or device information be contaminated?
MEE6
MEE6•3y ago
@eigensinnig just advanced to level 3! Thanks for your contributions! 🎉
extended-salmon
extended-salmonOP•3y ago
Not for Cloudflare, please test it
Lukas Krivka
Lukas Krivka•3y ago
Yeah, we need to fix it. The goal is to beat the stealth plugin. We are already likely better with Playwright and Firefox (the best combo) but need to catch up with Puppeteer It is on by default.
extended-salmon
extended-salmon•3y ago
I think the same
MEE6
MEE6•3y ago
@Samet just advanced to level 1! Thanks for your contributions! 🎉
HonzaS
HonzaS•3y ago
Thank you very much. Your solution with puppeteer-extra's StealthPlugin works like a charm. (at least for the url that crawlee even with playwright+firefox got always 403) I am still not sure how to incorporate it to the PuppeteerCrawler as in your example you do not use requestQueue but have the url in the constructor can you give a hint?
Lukas Krivka
Lukas Krivka•3y ago
Just do launcher: puppeteer
national-gold
national-gold•3y ago
I do not know how actual is the problem with chat.openai.com... Actually - a simple program with PlaywrightCrawler configured with Firefox on Linux is able to access this site, just did a screenshot in headless mode (without proxy! straight from the machine in data center):
No description
extended-salmon
extended-salmonOP•3y ago
It may be related to the trust score, I have a VPN on 24/7 but even then Crawlee is at fault because another tool is running
Lukas Krivka
Lukas Krivka•3y ago
It is always a combination of IP address + browser config. You cannot really forgot about one or other when doing blocking comparisons. Your local home IP is usually as clean as it gets (residential proxies are worse and datacenter even worse)
HonzaS
HonzaS•3y ago
Actually with stealth plugin it works even with datacenter proxies. With crawlee default config it did not work even with residential so I think it is was all about browser config at least in my case (g2 review pages).
extended-salmon
extended-salmonOP•3y ago
Just a note here, I will carry out detailed tests if needed: https://stateofscraping.org/ https://github.com/unblocked-web/double-agent
State of Scraping
State of Scraping is a report about detectability of popular scraping stacks compiled by the Data Liberation Foundation.
GitHub
GitHub - unblocked-web/double-agent: A test suite of common scraper...
A test suite of common scraper detection techniques. See how detectable your scraper stack is. - GitHub - unblocked-web/double-agent: A test suite of common scraper detection techniques. See how de...
optimistic-gold
optimistic-gold•3y ago
How to bypass or avoid this page (press & hold) while scrapping capterra reviews using Apify Actor
No description
optimistic-gold
optimistic-gold•3y ago
WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Request blocked - received 403 status code. 2023-04-07T03:49:37.993Z {"id":"kmPcFnRhSQM8xHs","url":"https://www.capterra.com/p/107199/Medallia-Enterprise/reviews/","retryCount":3} 2023-04-07T03:49:47.490Z ERROR PuppeteerCrawler: Request failed and reached maximum retries. Error: Request blocked - received 403 status code.
Pepa J
Pepa J•3y ago
This looks like quite specific captcha, which actor are you using? If you are developer to press and hold a button you may try similar solution as it is suggested there https://stackoverflow.com/a/68513568 This looks like your request is being blocked, have you tried to use different proxy group (ex. RESIDENTIAL)?
optimistic-gold
optimistic-gold•3y ago
@Pepa J Yes, i'm using below proxy const proxyConfiguration = await Actor.createProxyConfiguration({ //proxyUrls: ['http://groups-RESIDENTIAL:apify_proxy_Y0Tu3p0vZn05IDmvQlTQw9YboSwcJX4sDh56@proxy.apify.com:8000'], groups:['RESIDENTIAL'], countryCode: 'US' });
MEE6
MEE6•3y ago
@ankit21090 just advanced to level 1! Thanks for your contributions! 🎉

Did you find this page helpful?