taking list of scraped urls and conducting multiple new...

At a glance

The community member has a code that scrapes product URLs from an Amazon results page, but they are unable to take each link and scrape the needed information in another crawler. They ask if they need another Cheerio router and how they can take the scraped links and add them to a request list and request queue, then scrape the information from those URLs.

The comments suggest that the community member is creating a second router, which is not necessary. The community member should use one router per crawler and differentiate the routes using request labels. The syntax for adding handlers and requests is also incorrect. The comments provide relevant links to the Crawlee documentation for guidance on the correct usage of the CheerioCrawler and createCheerioRouter functions.

Useful resources

hharish

i have this code that scrapes product URLs from an Amazon results page
i am able to successfully scrape the product URLs, but I'm unable to take each link and scrape the needed info in another crawler
do i need another cheerio router
also how can i take each link once scraped and instead add it to a requestlist and requestqueue and then take the urls in that request queue and scrape that information

7 comments

hharish

here are the codes:

hharish

main.js:
import { CheerioCrawler } from 'crawlee';
import { router } from './routes.js';

const searchKeywords = 'computers'; // Replace with desired search keywords
const searchUrl = https://www.amazon.com/s?k=${searchKeywords};
const startUrls = [searchUrl];

const crawler = new CheerioCrawler({
// Start the crawler right away and ensure there will always be 5 concurrent requests
ran at any time
minConcurrency: 5,
// Ensure the crawler doesn't exceed 15 concurrent requests ran at any time
maxConcurrency: 15,
// ...but also ensure the crawler never exceeds 250 requests per minute
maxRequestsPerMinute: 250,
// Define router to run crawl
requestHandler: router
});

await crawler.run(startUrls)

hharish

routes.js:
import { CheerioCrawler, createCheerioRouter } from 'crawlee';
import fs from 'fs';

export const router = createCheerioRouter();
const linkArray = [];

router.addHandler(async ({ $ }) => {
// Scrape product links from search results page
const productLinks = $('h2 a').map((_, el) => 'https://www.amazon.com' +
$(el).attr('href')).get();
console.log(Found ${productLinks.length} product links);

// Add each product link to array (this is inside router[01])
for (const link of productLinks) {
const router02 = createCheerioRouter();
router02.addDefaultHandler(async ({ $ }) => {
const productInfo = {};
productInfo.storeName = 'Amazon';
productInfo.productTitle =
$('span.a-size-large.product-title-word-break').text().trim();
productInfo.productDescription =
$('div.a-row.a-size-base.a-color-secondary').text().trim();
productInfo.salePrice = $('span.a-offscreen').text().trim();
productInfo.originalPrice = $('span.a-price.a-text-price').text().trim();
productInfo.reviewScore = $('span.a-icon-alt').text().trim();
productInfo.shippingInfo =
$('div.a-row.a-size-base.a-color-secondary.s-align-children-center').text().trim();
// Write product info to JSON file
if (productInfoList.length > 0) {
const rawData = JSON.stringify(productInfo, null, 2);
fs.appendFile('rawData.json', rawData, (err) => {
if (err) throw err;
console.log(Product info written to rawData.json for ${link});
});
}
})

//router02.queue.addRequest({ url: link });
const amazon = new CheerioCrawler({
// Start the crawler right away and ensure there will always be 5 concurrent
requests ran at any time
minConcurrency: 1,
// Ensure the crawler doesn't exceed 15 concurrent requests ran at any time
maxConcurrency: 10,
// ...but also ensure the crawler never exceeds 400 requests per minute
maxRequestsPerMinute: 400,
// Define route for crawler to run on
requestHandler: router02
});
await amazon.run(link);
console.log('running link')
}
});

hharish

here is the console output i receive:
INFO CheerioCrawler: Starting the crawl
Found 36 product links
WARN CheerioCrawler: Reclaiming failed request back to the list or queue. Expected
requests to be of type array but received type string
{"id":"b1h8C8G7WjcTMKd","url":"https://www.amazon.com/s?k=computers","retryCount":1}
INFO CheerioCrawler: Crawl finished. Final request statistics:
{"requestsFinished":0,"requestsFailed":1,"retryHistogram":[null,null,null,1],"requestAvgFailedDur
ationMillis":1880,"requestAvgFinishedDurationMillis":null,"requestsFinishedPerMinute":0,"reque
stsFailedPerMinute":3,"requestTotalDurationMillis":1880,"requestsTotal":1,"crawlerRuntimeMillis
":18054}

hharish

here is the pdf as well with the codes especially if you are confused on different indents and what each function goes under'

hharish

AAndrey Bykov

That's a lot of code, but straight away I see that you're creating a second router. Why? You should use one router per crawler, and use different routes. You could differentiate them with request.label. router.addHandler is not correct syntax - you're not providing label here. It should be either default handler, or router.addHandler('SEARH_PAGE', async ....) while the first request, instead of just URL will be { url: searchUrl], label: 'SEARCH_PAGE' } . router02.queue.addRequest this is also not correct - it should crawler.addRequests([]) , while router is part of the context. Some relevant links:
https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler#router
https://crawlee.dev/api/cheerio-crawler/function/createCheerioRouter
https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlingContext

Add a reply

Apify Discord Mirror

taking list of scraped urls and conducting multiple new scrapes