Apify Discord Mirror

Updated 5 months ago

How to scrap emails to one level of nesting and give results to API

At a glance

The main question is how to send the answer correctly without saving the data in the Dataset. A community member suggests using res.json() to return the data instead of saving it to the Dataset. Another community member proposes an approach where the crawler runs on a URL passed in the request, and the response returns the items from the Dataset. However, there is no explicitly marked answer in the comments.

Useful resources
The main question probably is how to send the answer correctly and not save the data in the Dataset. For example with the same express.
L
R
3 comments
I'm not sure what exactly you are asking about. You can scrape all pages in level 1 depth by enqueueing all 'a[href]' elements from the home page and then using https://crawlee.dev/api/utils/namespace/social#emailsFromText
I went here such a scraper I do not know how well it is right 🙂
Plain Text
import { CheerioCrawler, Dataset, social } from 'crawlee';

const crawler = new CheerioCrawler({
  async requestHandler({ body, enqueueLinks, request }) {
    if (request.label !== 'TWICE') {
      await enqueueLinks({ label: 'TWICE' });
    }
    if (typeof body === 'string') {
      const handles = social.parseHandlesFromHtml(body);
      await Dataset.pushData({ handles, url: request.url });
    } else {
      throw new Error('Body is not a string');
    }
  }
});


I also have to implement the return of data to the request, but I do not know how to do it correctly.
There is an idea to replace await Dataset.pushData({ handles, url: request.url }); on something like this res.json({ handles, url: request.url }). That is, inside the requestHandler().
Another idea
Plain Text
app.get('/', async (req, res) => {
  await crawler.run([req.query.url]);
  const { items } = await Dataset.getData();
  res.json(items)
})
The last idea looks good, does it do what you need?
Add a reply
Sign up and join the conversation on Discord