Alex I
Alex I•16mo ago

How to prevent scrape if the URL is already in the dataset?

Should I check my dataset myself or is there some configuration setting that I should be aware of? Cheers!
14 Replies
NeoNomade
NeoNomade•16mo ago
Crawlee doesn't allow duplicates by default. Your crawler is visiting the same spider multiple times ?
Alex I
Alex IOP•16mo ago
Yep. But not in the same session.
const config = new Configuration({
defaultDatasetId: 'AllData',
});

const crawler = new CheerioCrawler(
{
requestHandler: router,
maxConcurrency: 20,
},
config
);
const config = new Configuration({
defaultDatasetId: 'AllData',
});

const crawler = new CheerioCrawler(
{
requestHandler: router,
maxConcurrency: 20,
},
config
);
If I accidentally re-transmit the link in a new session, it will be stored in storage again storage/000023.json
{
"url": "https://my_host/q/70",
"title": "title1",
"acceptedAnswer": 1,
"tags": "email",
"views": "2856",
"date": "2010-09-03",
"complexity": "5",
"answerCount": "1"
}
{
"url": "https://my_host/q/70",
"title": "title1",
"acceptedAnswer": 1,
"tags": "email",
"views": "2856",
"date": "2010-09-03",
"complexity": "5",
"answerCount": "1"
}
and again some time ago in a new session storage/009978.json
{
"url": "https://my_host/q/70",
"title": "title1",
"acceptedAnswer": 1,
"tags": "email",
"views": "2856",
"date": "2010-09-03",
"complexity": "5",
"answerCount": "1"
}
{
"url": "https://my_host/q/70",
"title": "title1",
"acceptedAnswer": 1,
"tags": "email",
"views": "2856",
"date": "2010-09-03",
"complexity": "5",
"answerCount": "1"
}
The data is identical, there is no need to obtain it again.
NeoNomade
NeoNomade•16mo ago
Those are obtained from a get or a post request ?
Alex I
Alex IOP•16mo ago
As I can know is a Cherio and there are GET requests.
NeoNomade
NeoNomade•16mo ago
What is your definition for a session ? Because you are definitely not using the session pool from crawlee.
Alex I
Alex IOP•16mo ago
I ran Crawlee from terminal.
NeoNomade
NeoNomade•16mo ago
And ? :)) I don't understand
Alex I
Alex IOP•16mo ago
There is full my code
import { CheerioCrawler, Configuration } from 'crawlee';

import { router } from './routes';

const startId = process.argv[2];
const conunt = process.argv[3];

await main(Number(startId), Number(conunt) || 1000);

export default async function main(startId: number, count: number) {
const config = new Configuration({
defaultDatasetId: 'AllData',
});

const crawler = new CheerioCrawler(
{
requestHandler: router,
maxConcurrency: 20,
},
config
);

if (startId >= 0) {
const urls = Array.from(Array(count).keys()).map((id) => `https://my_host/q/${startId + id}`);

await crawler.run(urls);
}
}
import { CheerioCrawler, Configuration } from 'crawlee';

import { router } from './routes';

const startId = process.argv[2];
const conunt = process.argv[3];

await main(Number(startId), Number(conunt) || 1000);

export default async function main(startId: number, count: number) {
const config = new Configuration({
defaultDatasetId: 'AllData',
});

const crawler = new CheerioCrawler(
{
requestHandler: router,
maxConcurrency: 20,
},
config
);

if (startId >= 0) {
const urls = Array.from(Array(count).keys()).map((id) => `https://my_host/q/${startId + id}`);

await crawler.run(urls);
}
}
I call this as
$> npx tsx ./src/main.ts 0 1000
$> npx tsx ./src/main.ts 0 1000
and this scrape from 0 to 1000 /q/{id}. If I pass args 200 1000 this scrape from 200 to 1200 /q/{id}. Sometimes I may make a mistake and enter the wrong arguments (in the sample abowe its from 200 to 1000). I expected that datastore would not re-create records of the same links. Also thanks for your patience 🙂
NeoNomade
NeoNomade•16mo ago
Data store doesn't have deduplication. So you need to build this on your own. If I run the same spider on 100 times, with the same urls . It will write the same data 100 times. Since you are combining multiple runs, you need to create your own deduplication system .
MEE6
MEE6•16mo ago
@NeoNomade just advanced to level 12! Thanks for your contributions! 🎉
NeoNomade
NeoNomade•16mo ago
The most basic one would be, to write record of the urls that have been "used" in a JSON file, and everytime you start a new run, read that JSON file before, in order to be able to deduplicate . More advanced stuff would involve a minimal database and reading and writing from there .
Alex I
Alex IOP•16mo ago
Oh! I understand. Thank you so much!
MEE6
MEE6•16mo ago
@Alex I just advanced to level 1! Thanks for your contributions! 🎉
NeoNomade
NeoNomade•16mo ago
No problem. Anytime

Did you find this page helpful?