suitable-rose
suitable-rose13mo ago

Confusion around configuring Crawlee through a tor proxy

Here is the code I'm working with currently:
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
import { firefox } from 'playwright';


const startUrls = ['https://crawlee.dev'];
const BBCNewsOnionStartUrls = ['https://www.bbcnewsd73hkzno2ini43t4gblxvycyac5aw4gnv7t2rccijh7745uqd.onion/'];
const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['socks5://localhost:9050'] });

const crawler = new PlaywrightCrawler({
launchContext: {
launcher: firefox,
launchOptions: {
proxy: {
server: 'socks5://localhost:9050'
},
headless: false,
}
},
proxyConfiguration: proxyConfiguration,
requestHandler: async ({ request, page, log }) => {
const pageTitle = await page.title();
log.info(`URL: ${request.loadedUrl} | Page title: ${pageTitle}`);
},
// Comment this option to scrape the full website.
maxRequestsPerCrawl: 20,
maxRequestsPerMinute: 10,
maxConcurrency: 1,
minConcurrency: 1,
sameDomainDelaySecs: 1,
});

// await crawler.run(startUrls);
await crawler.run(BBCNewsOnionStartUrls);
import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';
import { firefox } from 'playwright';


const startUrls = ['https://crawlee.dev'];
const BBCNewsOnionStartUrls = ['https://www.bbcnewsd73hkzno2ini43t4gblxvycyac5aw4gnv7t2rccijh7745uqd.onion/'];
const proxyConfiguration = new ProxyConfiguration({ proxyUrls: ['socks5://localhost:9050'] });

const crawler = new PlaywrightCrawler({
launchContext: {
launcher: firefox,
launchOptions: {
proxy: {
server: 'socks5://localhost:9050'
},
headless: false,
}
},
proxyConfiguration: proxyConfiguration,
requestHandler: async ({ request, page, log }) => {
const pageTitle = await page.title();
log.info(`URL: ${request.loadedUrl} | Page title: ${pageTitle}`);
},
// Comment this option to scrape the full website.
maxRequestsPerCrawl: 20,
maxRequestsPerMinute: 10,
maxConcurrency: 1,
minConcurrency: 1,
sameDomainDelaySecs: 1,
});

// await crawler.run(startUrls);
await crawler.run(BBCNewsOnionStartUrls);
When I use the proxyConfiguration I run into the following error ERROR PlaywrightCrawler: Request failed and reached maximum retries. page.goto: NS_ERROR_UNKNOWN_PROXY_HOST However, when I remove it, everything seems to work okay. So my question is why isn't proxyConfiguration needed in this case? Are all requests still being directed through the tor proxy I have running locally? Thanks! (I have verified the tor proxy is running via curl --socks5-hostname localhost:9050 https://check.torproject.org/api/ip
0 Replies
No replies yetBe the first to reply to this messageJoin

Did you find this page helpful?