Cheerio Crawler works for Amazon.de but gets detected bot at amazon.com
Cheerio Crawler works for Amazon.de but gets detected bot at amazon.com
At a glance
The community member is experimenting with a Cheerio crawler to scrape Amazon. The crawler works for the German marketplace but gets detected as a bot for the US marketplace. The community member is using a data center proxy for Germany, which works, but the US data center proxy does not. The community member has tried using the same proxy in a real browser, and it works, so the issue seems to be with the Cheerio configuration. Some community members suggest trying other proxies, such as residential proxies, as Amazon is well-protected, and different countries may have different levels of protection. Another community member notes that the Cheerio crawler uses fingerprints, and suggests adjusting the header generator options to include mobile devices, locales, and operating systems. One community member says they have no issues as long as they send the user agent in the headers, but it's unclear if this works for the .com or .de marketplace.
Dear all, I am experimenting with cheerio crawler to scrape Amazon. I followed the tutorial online and it works for Germany but the same crawler gets detected as a bot for US. For Germany, I am using a data center proxy of Germany and it works but for USA the datacenter proxy of US doesn't work. Below is the configuration. I am building an Amazon scraper for multiple marketplaces. But this inconsistency makes it challenging.
const crawler = new CheerioCrawler({
proxyConfiguration,
requestQueue: queue,
useSessionPool: true,
persistCookiesPerSession: true,
maxRequestRetries: 20,
maxRequestsPerMinute: 250,
autoscaledPoolOptions:{
maxConcurrency:100,
minConcurrency: 5,
isFinishedFunction: async () => {
// Tell the pool whether it should finish
// or wait for more tasks to become available.
// Return true or false
return false
}
},
failedRequestHandler: async (context) => rebirth_requests({ ...context}),
requestHandler: async (context) => router({ ...context, dbPool})
//sessionPoolOptions:{blockedStatusCodes:[]},
});
Have you tried other proxies (groups, maybe residential). But amazon is quite protected, and it's common that one country will be better protected than the other for the "same" website