Cheerio Crawler works for Amazon.de but gets detected b...

At a glance

The community member is experimenting with a Cheerio crawler to scrape Amazon. The crawler works for the German marketplace but gets detected as a bot for the US marketplace. The community member is using a data center proxy for Germany, which works, but the US data center proxy does not. The community member has tried using the same proxy in a real browser, and it works, so the issue seems to be with the Cheerio configuration. Some community members suggest trying other proxies, such as residential proxies, as Amazon is well-protected, and different countries may have different levels of protection. Another community member notes that the Cheerio crawler uses fingerprints, and suggests adjusting the header generator options to include mobile devices, locales, and operating systems. One community member says they have no issues as long as they send the user agent in the headers, but it's unclear if this works for the .com or .de marketplace.

ccurioussoul

Dear all, I am experimenting with cheerio crawler to scrape Amazon. I followed the tutorial online and it works for Germany but the same crawler gets detected as a bot for US. For Germany, I am using a data center proxy of Germany and it works but for USA the datacenter proxy of US doesn't work. Below is the configuration. I am building an Amazon scraper for multiple marketplaces. But this inconsistency makes it challenging.

const crawler = new CheerioCrawler({
    proxyConfiguration,
    requestQueue: queue,
    useSessionPool: true,
    persistCookiesPerSession: true,
    maxRequestRetries: 20,
    maxRequestsPerMinute: 250,
    autoscaledPoolOptions:{
      maxConcurrency:100,
      minConcurrency: 5,
      isFinishedFunction: async () => {
        // Tell the pool whether it should finish
        // or wait for more tasks to become available.
        // Return true or false
        return false
    }
    },
    failedRequestHandler: async (context) => rebirth_requests({ ...context}),
    requestHandler: async (context) => router({ ...context, dbPool})
    //sessionPoolOptions:{blockedStatusCodes:[]},

});

9 comments

ccurioussoul

When I use this proxy in my system with real browser it works. So I assume proxy is fine only problem is the config in cheerio.

AAndrey Bykov

Have you tried other proxies (groups, maybe residential). But amazon is quite protected, and it's common that one country will be better protected than the other for the "same" website

ccurioussoul

But the point is same proxy works in the browser. So an http call in a browser with same with same proxy works but in cheerio doesnt.

To me it feels like headers, cookies etc when browser is used is different then what being used in cheerio.

Is there any fingerprints used when we scrap via cheerio ?

ccurioussoul

I tried out same proxy in playwright and it works. So there must be some settings different in Cheerio which are inconsistent.

AAndrey Bykov

CheerioCrawler is using got-scraping, and yes - it use the fingerprints..

ccurioussoul

preNavigationHooks: [
      async (crawlingContext, gotOptions) => {
          // ...
          gotOptions.headerGeneratorOptions= {
          //  browsers: [
          //    {
          //        name: 'chrome',
          //        minVersion: 90,
          //        maxVersion: 100
          //    }
          //],
          devices: ['mobile'],
          locales: ['en-US'],
          operatingSystems: ['ios','android'],

          }
      },

with these settings its better now.

hharish

for me i have no problem as long as i send the user agent in the headers

ccurioussoul

Ok for .com or for .de ?

hharish

.com

Add a reply

Apify Discord Mirror

Cheerio Crawler works for Amazon.de but gets detected bot at amazon.com