like-gold
like-gold•3y ago

How to scrap emails to one level of nesting and give results to API

The main question probably is how to send the answer correctly and not save the data in the Dataset. For example with the same express.
3 Replies
Lukas Krivka
Lukas Krivka•3y ago
I'm not sure what exactly you are asking about. You can scrape all pages in level 1 depth by enqueueing all 'a[href]' elements from the home page and then using https://crawlee.dev/api/utils/namespace/social#emailsFromText
like-gold
like-goldOP•3y ago
I went here such a scraper I do not know how well it is right 🙂
import { CheerioCrawler, Dataset, social } from 'crawlee';

const crawler = new CheerioCrawler({
async requestHandler({ body, enqueueLinks, request }) {
if (request.label !== 'TWICE') {
await enqueueLinks({ label: 'TWICE' });
}
if (typeof body === 'string') {
const handles = social.parseHandlesFromHtml(body);
await Dataset.pushData({ handles, url: request.url });
} else {
throw new Error('Body is not a string');
}
}
});
import { CheerioCrawler, Dataset, social } from 'crawlee';

const crawler = new CheerioCrawler({
async requestHandler({ body, enqueueLinks, request }) {
if (request.label !== 'TWICE') {
await enqueueLinks({ label: 'TWICE' });
}
if (typeof body === 'string') {
const handles = social.parseHandlesFromHtml(body);
await Dataset.pushData({ handles, url: request.url });
} else {
throw new Error('Body is not a string');
}
}
});
I also have to implement the return of data to the request, but I do not know how to do it correctly. There is an idea to replace await Dataset.pushData({ handles, url: request.url }); on something like this res.json({ handles, url: request.url }). That is, inside the requestHandler(). Another idea
app.get('/', async (req, res) => {
await crawler.run([req.query.url]);
const { items } = await Dataset.getData();
res.json(items)
})
app.get('/', async (req, res) => {
await crawler.run([req.query.url]);
const { items } = await Dataset.getData();
res.json(items)
})
Lukas Krivka
Lukas Krivka•3y ago
The last idea looks good, does it do what you need?

Did you find this page helpful?