I went here such a scraper I do not know how well it is right 🙂
import { CheerioCrawler, Dataset, social } from 'crawlee';
const crawler = new CheerioCrawler({
async requestHandler({ body, enqueueLinks, request }) {
if (request.label !== 'TWICE') {
await enqueueLinks({ label: 'TWICE' });
}
if (typeof body === 'string') {
const handles = social.parseHandlesFromHtml(body);
await Dataset.pushData({ handles, url: request.url });
} else {
throw new Error('Body is not a string');
}
}
});
I also have to implement the return of data to the request, but I do not know how to do it correctly.
There is an idea to replace
await Dataset.pushData({ handles, url: request.url });
on something like this
res.json({ handles, url: request.url })
. That is, inside the
requestHandler()
.
Another idea
app.get('/', async (req, res) => {
await crawler.run([req.query.url]);
const { items } = await Dataset.getData();
res.json(items)
})