vicious-gold
vicious-gold11mo ago

Anyone have any example scraping multiple different websites?

The structure i am doing idoes not look like the best. I am basically creating several routers and then doing something like:
const crawler = new PlaywrightCrawler({
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: async (ctx) => {
if (ctx.request.url.includes("url1")) {
await url1Router(ctx);
}

if (ctx.request.url.includes("url2")) {
await url2Router(ctx);
}

if (ctx.request.url.includes("url3")) {
await url3Router(ctx);
}
await Dataset.exportToJSON("data.json");
},

// Comment this option to scrape the full website.

// maxRequestsPerCrawl: 20,
});
const crawler = new PlaywrightCrawler({
// proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
requestHandler: async (ctx) => {
if (ctx.request.url.includes("url1")) {
await url1Router(ctx);
}

if (ctx.request.url.includes("url2")) {
await url2Router(ctx);
}

if (ctx.request.url.includes("url3")) {
await url3Router(ctx);
}
await Dataset.exportToJSON("data.json");
},

// Comment this option to scrape the full website.

// maxRequestsPerCrawl: 20,
});
This does not seem correct. Anyone with a better way?
6 Replies
Hall
Hall11mo ago
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
Marco
Marco11mo ago
You can use Crawlee's Router: https://crawlee.dev/api/playwright-crawler/function/createPlaywrightRouter. Create a route for each URL, then use labels to identify them.
vicious-gold
vicious-goldOP11mo ago
@Marco , how far is that from what i am doing there? because it seems like soewhere i will have to do it? in the example above i did a router per url there , urlRouter1, urlRouter2 is defined on a per url basis. am i wrong?
Marco
Marco11mo ago
It's actually very similar. Routes should be defined depending on your needs, so if you need a route per URL, just do that.
vicious-gold
vicious-goldOP11mo ago
my concern is that i have multiple websites, not just different urls. each website might have two urls that i have to scrape independently is that how you would do it @Marco ? would you have multiple routers ?
Marco
Marco10mo ago
Oh, I see. I think I would still use one router, with labels such as "website1-page2", to keep things simple; a function called at the beginning would assign the correct label to each request based on the URL.

Did you find this page helpful?