sensitive-blue
sensitive-blue3y ago

double import problem

i have my crawler to crawl a couple of sites and scrape them but i get this import problem when importing the router (which is the same for both sites but using a different route for the sites) from both of the sites if i only import it from one site, it only runs one site how do i import it so it runs multiple sites and make it so it can scale up to multiple sites in the near future it can successfully scrape amazon and ebay (ebay tags are kinda innacurate) but if only if i use the router from ebay or amazon and remove the other url from starturls, or else it gives an error for not having the AMAZON label or EBAY label anywhere
4 Replies
sensitive-blue
sensitive-blueOP3y ago
main.js:

import { CheerioCrawler, ProxyConfiguration, AutoscaledPool, SessionPool } from 'crawlee';
import { router } from './amazon.js';
import { router } from './ebay.js';


const searchKeywords = 'hydroflasks'; // Replace with desired search keywords

const startUrls = [
{ url: `https://www.amazon.com/s?k=${searchKeywords}`, label: 'AMAZON' },
{ url: `https://www.ebay.com/sch/i.html?_nkw=${searchKeywords}`, label: 'EBAY' },
];

const crawler = new CheerioCrawler({
useSessionPool: true,
sessionPoolOptions: { maxPoolSize: 100 },
// Set to true if you want the crawler to save cookies per session,
// and set the cookie header to request automatically (default is true).
persistCookiesPerSession: true,
// ...but also ensure the crawler never exceeds 250 requests per minute
maxRequestsPerMinute: 250,

// Define router to run crawl
requestHandler: router
});

export { crawler }

await crawler.run(startUrls);

import { CheerioCrawler, ProxyConfiguration, AutoscaledPool, SessionPool } from 'crawlee';
import { router } from './amazon.js';
import { router } from './ebay.js';


const searchKeywords = 'hydroflasks'; // Replace with desired search keywords

const startUrls = [
{ url: `https://www.amazon.com/s?k=${searchKeywords}`, label: 'AMAZON' },
{ url: `https://www.ebay.com/sch/i.html?_nkw=${searchKeywords}`, label: 'EBAY' },
];

const crawler = new CheerioCrawler({
useSessionPool: true,
sessionPoolOptions: { maxPoolSize: 100 },
// Set to true if you want the crawler to save cookies per session,
// and set the cookie header to request automatically (default is true).
persistCookiesPerSession: true,
// ...but also ensure the crawler never exceeds 250 requests per minute
maxRequestsPerMinute: 250,

// Define router to run crawl
requestHandler: router
});

export { crawler }

await crawler.run(startUrls);
error message when run:
file:///C:/Users/haris/OneDrive/Documents/GitHub/crawleeScraper/my-crawler/src/main.js:3
import { router } from './ebay.js';
^^^^^^

SyntaxError: Identifier 'router' has already been declared
at ESMLoader.moduleStrategy (node:internal/modules/esm/translators:119:18)
at ESMLoader.moduleProvider (node:internal/modules/esm/loader:468:14)
at async link (node:internal/modules/esm/module_job:68:21)
file:///C:/Users/haris/OneDrive/Documents/GitHub/crawleeScraper/my-crawler/src/main.js:3
import { router } from './ebay.js';
^^^^^^

SyntaxError: Identifier 'router' has already been declared
at ESMLoader.moduleStrategy (node:internal/modules/esm/translators:119:18)
at ESMLoader.moduleProvider (node:internal/modules/esm/loader:468:14)
at async link (node:internal/modules/esm/module_job:68:21)
amazon.js:
import { createCheerioRouter } from 'crawlee';
import fs, { link } from 'fs';
import { crawler } from './main.js';

export const router = createCheerioRouter();

router.addHandler('AMAZON', async ({ $, crawler }) => {
console.log('starting link scrape')
// Scrape product links from search results page
const productLinks = $('h2 a').map((_, el) => 'https://www.amazon.com' + $(el).attr('href')).get();
console.log(`Found ${productLinks.length} product links for Amazon`);
console.log(productLinks)

// Add each product link to request queue
for (const link of productLinks) {
const result = await crawler.addRequests([{ url: link, label: 'AMAZON_PRODUCT' }])
await result.waitForAllRequestsToBeAdded;
}

// Check if there are more pages to scrape
const nextPageLink = $('a[title="Next"]').attr('href');
if (nextPageLink) {
// Construct the URL for the next page
const nextPageUrl = 'https://www.amazon.com' + nextPageLink;

// Add the request for the next page
const result = await crawler.addRequests([{ url: nextPageUrl, label: 'AMAZON' }]);
await result.waitForAllRequestsToBeAdded;
}
});
import { createCheerioRouter } from 'crawlee';
import fs, { link } from 'fs';
import { crawler } from './main.js';

export const router = createCheerioRouter();

router.addHandler('AMAZON', async ({ $, crawler }) => {
console.log('starting link scrape')
// Scrape product links from search results page
const productLinks = $('h2 a').map((_, el) => 'https://www.amazon.com' + $(el).attr('href')).get();
console.log(`Found ${productLinks.length} product links for Amazon`);
console.log(productLinks)

// Add each product link to request queue
for (const link of productLinks) {
const result = await crawler.addRequests([{ url: link, label: 'AMAZON_PRODUCT' }])
await result.waitForAllRequestsToBeAdded;
}

// Check if there are more pages to scrape
const nextPageLink = $('a[title="Next"]').attr('href');
if (nextPageLink) {
// Construct the URL for the next page
const nextPageUrl = 'https://www.amazon.com' + nextPageLink;

// Add the request for the next page
const result = await crawler.addRequests([{ url: nextPageUrl, label: 'AMAZON' }]);
await result.waitForAllRequestsToBeAdded;
}
});
router.addHandler('AMAZON_PRODUCT', async ({ $, request }) => {
const productInfo = {};
productInfo.link = request.url;
productInfo.storeName = 'Amazon';
productInfo.productTitle = $('span#productTitle').text().trim();
productInfo.productDescription = $('div#productDescription').text().trim();
productInfo.salePrice = $('span#priceblock_ourprice').text().trim();
productInfo.originalPrice = $('span.priceBlockStrikePriceString').text().trim();
productInfo.reviewScore = $('span#acrPopover').attr('title');
productInfo.shippingInfo = $('div#ourprice_shippingmessage').text().trim();

// Write product info to JSON file
if (Object.keys(productInfo).length > 0) {
const rawData = JSON.stringify(productInfo, null, 2);
fs.appendFile('rawData.json', rawData, (err) => {
if (err) throw err;
});
}
console.log(`Product info written to rawData.json for amazon`);
});
router.addHandler('AMAZON_PRODUCT', async ({ $, request }) => {
const productInfo = {};
productInfo.link = request.url;
productInfo.storeName = 'Amazon';
productInfo.productTitle = $('span#productTitle').text().trim();
productInfo.productDescription = $('div#productDescription').text().trim();
productInfo.salePrice = $('span#priceblock_ourprice').text().trim();
productInfo.originalPrice = $('span.priceBlockStrikePriceString').text().trim();
productInfo.reviewScore = $('span#acrPopover').attr('title');
productInfo.shippingInfo = $('div#ourprice_shippingmessage').text().trim();

// Write product info to JSON file
if (Object.keys(productInfo).length > 0) {
const rawData = JSON.stringify(productInfo, null, 2);
fs.appendFile('rawData.json', rawData, (err) => {
if (err) throw err;
});
}
console.log(`Product info written to rawData.json for amazon`);
});
ebay.js:
ebay.js:
import { createCheerioRouter } from 'crawlee'; import fs, { link } from 'fs'; import { crawler } from './main.js'; export const router = createCheerioRouter(); router.addHandler('EBAY', async ({ $, crawler }) => {
console.log('starting link scrape') // Scrape product links from search results page const productLinks = $('a.iteminfo-link').map((_, el) => $(el).attr('href')).get(); console.log(Found ${productLinks.length} product links for eBay); // Add each product link to request queue for (const link of productLinks) { const result = await crawler.addRequests([{ url: link, label: 'EBAY_PRODUCT' }]) await result.waitForAllRequestsToBeAdded; } }); router.addHandler('EBAY_PRODUCt', async ({ $, request }) => { const productInfo = {}; productInfo.link = request.url; productInfo.storeName = 'eBay'; productInfo.productTitle = $('h3.s-itemtitle').text().trim(); productInfo.productDescription = $('div.a-section.a-spacing-small.span.a-size-base-plus').text().trim(); productInfo.salePrice = $('span.s-itemprice').text().trim(); productInfo.originalPrice = $('span.s-itemprice--original').text().trim(); productInfo.reviewScore = $('div.s-itemreviews').text().trim(); productInfo.shippingInfo = $('span.s-itemshipping').text().trim(); // Write product info to JSON file if (Object.keys(productInfo).length > 0) { const rawData = JSON.stringify(productInfo, null, 2); fs.appendFile('rawData.json', rawData, (err) => { if (err) throw err; }); } }); ``` (haven't added pagination this scraper yet)
sensitive-blue
sensitive-blue3y ago
I already replied in another thread. You should use one router instance
sensitive-blue
sensitive-blueOP3y ago
i am using the same router instance for all of the sites but on different route handlers the problem comes when i am importing the same router instance from different files but i have each scrape running on a different file so i need to import the same router instance to the main.js file but that is where i have a problem
sensitive-blue
sensitive-blue3y ago
you're creting two instances - one in amazon.js: export const router = createCheerioRouter(); and another one in ebay.js export const router = createCheerioRouter(); the fastest thing that come to my mind is to have this line export const router = createCheerioRouter(); in main.js and in both ebay and amazon files - do import { router } from './main.js'

Did you find this page helpful?