conducting faster scrapes with pagination and individua...

At a glance

The community member is scraping Amazon and is concerned about the time it takes to complete the scraping process. They are using Cheerio, a web scraping library, and are scraping product links from the results page, then scraping each individual product page for information. They are also paginating through the results pages until there are no more pages left.

The community members are looking for ways to increase the speed of their scrapes, especially as they want to add more scrapers in the future to run concurrently. They are aiming for a scrape time of 10-15 seconds or lower, but it is currently taking upwards of 1 minute.

In the comments, other community members suggest that Cheerio is a fast solution, but the speed is limited by network speed and response time. They also mention that sending subsequent requests for a single product can take extra time. The solution proposed is to increase concurrency by using more available memory and CPU power.

Additionally, one community member asks if there are methods in Crawlee (a web scraping library) to scrape different sites and links within a site in parallel, as they observe that each results page and product page is scraped one by one. Another community member responds that this is done automatically by Crawlee, as it opens the pages in parallel when there is spare memory an

hharish

hey i was curious that when im scraping amazon, what's a reasonable time frame for the scraping duration considering i scrape each product link from the results page and then scrape each individual product page for the information and also paginate through each results page until there are no more pages left
i did previously just scrape product info straight of product cards on the results page but it would some times give dummy links that would lead to an unrelated amazon page and the product info would be more innacurate
how can i increase the speed of my scrapes, especially considering i want add on more and more scrapers in the future that i all want to happen concurrently to save time, im aiming for quite a low scrape time of within 10 seconds - 15 seconds or lower and its taking upwards of 1 minute
this is a cheerio crawler

3 comments

AAndrey Bykov

Cheerio is pretty much the fastest solution (faster would only be cheerio + using API/XHR links with structured JSON). So with this you're pretty much limited by the network speed/response time. Also - you should consider that if you send some subsequent requests for 1 product - it will take some extra time. But otherwise higher concurrency (more availably memory, CPU power) solves the problem

hharish

are there any methods in crawlee to parallely scrape different sites and links within that specific site because i see that each results page in the site is scraped one by one and so is each product page, so is there any way to do so

AAndrey Bykov

It's done automatically out of the box. When there's spare memory/cpu capacity - autoscaled pool start more requests. Per se every page have to be opened, but crawlee opens these pages in parallel

Add a reply

Apify Discord Mirror

conducting faster scrapes with pagination and individual product scraping