adverse-sapphire
adverse-sapphire3y ago

Need help bypassing CF 403 Blocked

Hi guys, i'm new to this community and i'm trying to scrape allpeople.com which has Cloudflare protection. After reading the docs I came up with two solutions - puppeteer-stealth and playwright/firefox combinations. Both are getting 403 Blocked by CF (i will share code snippets inside the thread) Am I doing something wrong? Or if not, what else can I try to bypass CF 403?
6 Replies
adverse-sapphire
adverse-sapphireOP3y ago
Puppeteer-stealth
import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);
import { PuppeteerCrawler, puppeteerUtils } from 'crawlee';
import { Actor } from 'apify';

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new PuppeteerCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
proxyConfiguration,
launchContext: {
launcher: puppeteer
},
async requestHandler({ request, page }) {
const title = await page.$eval('h1', (el) => el.textContent);
console.log('title', title);
},
async errorHandler({ session, proxyInfo }) {
console.log('proxyInfo', proxyInfo);
await session.retire();
},
maxRequestRetries: 5
});

await crawler.run([
{ url: 'https://allpeople.com/search?ss=peter+michaek&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=' }
]);

await Actor.exit();
import puppeteerVanilla from "puppeteer";
import { addExtra } from "puppeteer-extra";
const puppeteer = addExtra(puppeteerVanilla);
import { PuppeteerCrawler, puppeteerUtils } from 'crawlee';
import { Actor } from 'apify';

import StealthPlugin from "puppeteer-extra-plugin-stealth";
puppeteer.use(StealthPlugin());

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration();

const crawler = new PuppeteerCrawler({
useSessionPool: true,
persistCookiesPerSession: true,
proxyConfiguration,
launchContext: {
launcher: puppeteer
},
async requestHandler({ request, page }) {
const title = await page.$eval('h1', (el) => el.textContent);
console.log('title', title);
},
async errorHandler({ session, proxyInfo }) {
console.log('proxyInfo', proxyInfo);
await session.retire();
},
maxRequestRetries: 5
});

await crawler.run([
{ url: 'https://allpeople.com/search?ss=peter+michaek&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=' }
]);

await Actor.exit();
Playwright/firefox
import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';
import { Actor } from 'apify';

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['RESIDENTIAL'],
countryCode: 'US',
});

const crawler = new PlaywrightCrawler({
launchContext: {
launcher: firefox,
launchOptions: {
headless: true
}
},
proxyConfiguration,
async requestHandler({ request, page, log }) {
const title = await page.$eval('h1', (el) => el.textContent);
log.info('title', title);
},
});

await crawler.addRequests(['https://allpeople.com/search?ss=Blanca+murillo&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=']);

await crawler.run();

await Actor.exit();
import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';
import { Actor } from 'apify';

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration({
groups: ['RESIDENTIAL'],
countryCode: 'US',
});

const crawler = new PlaywrightCrawler({
launchContext: {
launcher: firefox,
launchOptions: {
headless: true
}
},
proxyConfiguration,
async requestHandler({ request, page, log }) {
const title = await page.$eval('h1', (el) => el.textContent);
log.info('title', title);
},
});

await crawler.addRequests(['https://allpeople.com/search?ss=Blanca+murillo&ss-e=&ss-p=&ss-i=&where=&industry-auto=&where-auto=']);

await crawler.run();

await Actor.exit();
HonzaS
HonzaS3y ago
firstly you should find out if the problem is with the automated browser or with the proxies
rare-sapphire
rare-sapphire3y ago
hey there! I briefly checked the site and it does go through CF. But it seems like CF started to send 403 code for their check page itself. This means the site is actually loaded, but the crawler thinks it's blocked :/ Adding sessionPoolOptions: { blockedStatusCodes: [] }, to crawler options solves the problem
adverse-sapphire
adverse-sapphireOP3y ago
Thanks, @Andrey Bykov
fascinating-indigo
fascinating-indigo3y ago
Hey, where de find your residentials proxies ?
rare-sapphire
rare-sapphire3y ago
Sorry, what do you mean? Proxies are available in your Apify account.

Did you find this page helpful?