automatic-azure
automatic-azure2y ago

Adding multiple requestHandler as a workflow

I am making a crawler which scrapes socials from a website. So the current workflow of the crawler I want is: Given a list of urls and names of the business websites - It first goes to the homepage and looks for socials there - then enques the links which includes path like contact or about and crawl that and then moves to the other url The thing I want to add is after it finishes with scraping homepage, contact or about (if any). I have another class which takes page as a params and has a function getGoogleResults (get top search results when enter a business name). I want to invoke that requestHandler which invokes getGoogleResults(query) function and add the results to the Datasets. Another thing I want to do is make the datasets returns the results instead saving it individually. The problem with that is it saves duplicates. For example if the socials are in the footer, and it also has about page. then socials are saved in duplicates twice which I want to remove that
2 Replies
automatic-azure
automatic-azureOP2y ago
my current requestHandler code:
router.addDefaultHandler(async ({ request, page, enqueueLinks, log, parseWithCheerio }) => {
log.info(`Finding socials at: ${request.url}`);
const $ = await parseWithCheerio();


const url = new URL(request.url.replace(/\/$/, ''));
const images = []

const result = {
emails: [...new Set(emails)],
socials: [...new Set(socials)],
images,
};
log.info(`Found ${result.emails.length} emails and ${result.socials.length} socials.`);
if (result.emails.length > 0 || result.socials.length > 0) {
await Dataset.pushData(result);
}
await enqueueLinks({
strategy: EnqueueStrategy.SameDomain,
globs: [`${url}/contact*`, `${url}/about*`],

});
});
router.addDefaultHandler(async ({ request, page, enqueueLinks, log, parseWithCheerio }) => {
log.info(`Finding socials at: ${request.url}`);
const $ = await parseWithCheerio();


const url = new URL(request.url.replace(/\/$/, ''));
const images = []

const result = {
emails: [...new Set(emails)],
socials: [...new Set(socials)],
images,
};
log.info(`Found ${result.emails.length} emails and ${result.socials.length} socials.`);
if (result.emails.length > 0 || result.socials.length > 0) {
await Dataset.pushData(result);
}
await enqueueLinks({
strategy: EnqueueStrategy.SameDomain,
globs: [`${url}/contact*`, `${url}/about*`],

});
});
@Helper
Lukas Krivka
Lukas Krivka2y ago
You can always enqueue the next request from the previous one so you always go 1 by 1: home -> about -> contact -> google. You can pass the intermediate data in request.userData and only push in the last step

Did you find this page helpful?