How to scrap emails to one level of nesting and give re...

At a glance

The main question is how to send the answer correctly without saving the data in the Dataset. A community member suggests using res.json() to return the data instead of saving it to the Dataset. Another community member proposes an approach where the crawler runs on a URL passed in the request, and the response returns the items from the Dataset. However, there is no explicitly marked answer in the comments.

Useful resources

RRomja

The main question probably is how to send the answer correctly and not save the data in the Dataset. For example with the same express.

3 comments

LLukas Krivka

I'm not sure what exactly you are asking about. You can scrape all pages in level 1 depth by enqueueing all 'a[href]' elements from the home page and then using https://crawlee.dev/api/utils/namespace/social#emailsFromText

RRomja

I went here such a scraper I do not know how well it is right 🙂

Plain Text

import { CheerioCrawler, Dataset, social } from 'crawlee';

const crawler = new CheerioCrawler({
  async requestHandler({ body, enqueueLinks, request }) {
    if (request.label !== 'TWICE') {
      await enqueueLinks({ label: 'TWICE' });
    }
    if (typeof body === 'string') {
      const handles = social.parseHandlesFromHtml(body);
      await Dataset.pushData({ handles, url: request.url });
    } else {
      throw new Error('Body is not a string');
    }
  }
});

I also have to implement the return of data to the request, but I do not know how to do it correctly.
There is an idea to replace await Dataset.pushData({ handles, url: request.url }); on something like this res.json({ handles, url: request.url }). That is, inside the requestHandler().
Another idea

Plain Text

app.get('/', async (req, res) => {
  await crawler.run([req.query.url]);
  const { items } = await Dataset.getData();
  res.json(items)
})

LLukas Krivka

The last idea looks good, does it do what you need?

Add a reply

Apify Discord Mirror

How to scrap emails to one level of nesting and give results to API