Alex I•16mo ago

How to prevent scrape if the URL is already in the dataset?

Should I check my dataset myself or is there some configuration setting that I should be aware of? Cheers!

14 Replies

Crawlee doesn't allow duplicates by default. Your crawler is visiting the same spider multiple times ?

Alex IOP•16mo ago

Yep. But not in the same session.

  const config = new Configuration({
    defaultDatasetId: 'AllData',
  });

  const crawler = new CheerioCrawler(
    {
      requestHandler: router,
      maxConcurrency: 20,
    },
    config
  );

  const config = new Configuration({
    defaultDatasetId: 'AllData',
  });

  const crawler = new CheerioCrawler(
    {
      requestHandler: router,
      maxConcurrency: 20,
    },
    config
  );

If I accidentally re-transmit the link in a new session, it will be stored in storage again storage/000023.json

{
    "url": "https://my_host/q/70",
    "title": "title1",
    "acceptedAnswer": 1,
    "tags": "email",
    "views": "2856",
    "date": "2010-09-03",
    "complexity": "5",
    "answerCount": "1"
}

{
    "url": "https://my_host/q/70",
    "title": "title1",
    "acceptedAnswer": 1,
    "tags": "email",
    "views": "2856",
    "date": "2010-09-03",
    "complexity": "5",
    "answerCount": "1"
}

and again some time ago in a new session storage/009978.json

{
    "url": "https://my_host/q/70",
    "title": "title1",
    "acceptedAnswer": 1,
    "tags": "email",
    "views": "2856",
    "date": "2010-09-03",
    "complexity": "5",
    "answerCount": "1"
}

{
    "url": "https://my_host/q/70",
    "title": "title1",
    "acceptedAnswer": 1,
    "tags": "email",
    "views": "2856",
    "date": "2010-09-03",
    "complexity": "5",
    "answerCount": "1"
}

The data is identical, there is no need to obtain it again.

NeoNomade•16mo ago

Those are obtained from a get or a post request ?

Alex IOP•16mo ago

As I can know is a Cherio and there are GET requests.

NeoNomade•16mo ago

What is your definition for a session ? Because you are definitely not using the session pool from crawlee.

Alex IOP•16mo ago

I ran Crawlee from terminal.

NeoNomade•16mo ago

And ? :)) I don't understand

Alex IOP•16mo ago

There is full my code

import { CheerioCrawler, Configuration } from 'crawlee';

import { router } from './routes';

const startId = process.argv[2];
const conunt = process.argv[3];

await main(Number(startId), Number(conunt) || 1000);

export default async function main(startId: number, count: number) {
  const config = new Configuration({
    defaultDatasetId: 'AllData',
  });

  const crawler = new CheerioCrawler(
    {
      requestHandler: router,
      maxConcurrency: 20,
    },
    config
  );

  if (startId >= 0) {
    const urls = Array.from(Array(count).keys()).map((id) => `https://my_host/q/${startId + id}`);

    await crawler.run(urls);
  }
}

import { CheerioCrawler, Configuration } from 'crawlee';

import { router } from './routes';

const startId = process.argv[2];
const conunt = process.argv[3];

await main(Number(startId), Number(conunt) || 1000);

export default async function main(startId: number, count: number) {
  const config = new Configuration({
    defaultDatasetId: 'AllData',
  });

  const crawler = new CheerioCrawler(
    {
      requestHandler: router,
      maxConcurrency: 20,
    },
    config
  );

  if (startId >= 0) {
    const urls = Array.from(Array(count).keys()).map((id) => `https://my_host/q/${startId + id}`);

    await crawler.run(urls);
  }
}

I call this as

$> npx tsx ./src/main.ts 0 1000

$> npx tsx ./src/main.ts 0 1000

and this scrape from 0 to 1000 /q/{id}. If I pass args 200 1000 this scrape from 200 to 1200 /q/{id}. Sometimes I may make a mistake and enter the wrong arguments (in the sample abowe its from 200 to 1000). I expected that datastore would not re-create records of the same links. Also thanks for your patience 🙂

NeoNomade•16mo ago

Data store doesn't have deduplication. So you need to build this on your own. If I run the same spider on 100 times, with the same urls . It will write the same data 100 times. Since you are combining multiple runs, you need to create your own deduplication system .

MEE6•16mo ago

@NeoNomade just advanced to level 12! Thanks for your contributions! 🎉

NeoNomade•16mo ago

The most basic one would be, to write record of the urls that have been "used" in a JSON file, and everytime you start a new run, read that JSON file before, in order to be able to deduplicate . More advanced stuff would involve a minimal database and reading and writing from there .

Alex IOP•16mo ago

Oh! I understand. Thank you so much!

MEE6•16mo ago

@Alex I just advanced to level 1! Thanks for your contributions! 🎉

NeoNomade•16mo ago

No problem. Anytime

How to prevent scrape if the URL is already in the dataset?

Did you find this page helpful?