How to prevent scrape if the URL is already in the dataset?
Should I check my dataset myself or is there some configuration setting that I should be aware of?
Cheers!
14 Replies
Crawlee doesn't allow duplicates by default. Your crawler is visiting the same spider multiple times ?
Yep. But not in the same session.
If I accidentally re-transmit the link in a new session, it will be stored in storage again
storage/000023.json
and again some time ago in a new session
storage/009978.json
The data is identical, there is no need to obtain it again.Those are obtained from a get or a post request ?
As I can know is a Cherio and there are GET requests.
What is your definition for a session ? Because you are definitely not using the session pool from crawlee.
I ran Crawlee from terminal.
And ? :))
I don't understand
There is full my code
I call this as
and this scrape from 0 to 1000
/q/{id}
. If I pass args 200 1000
this scrape from 200 to 1200 /q/{id}
.
Sometimes I may make a mistake and enter the wrong arguments (in the sample abowe its from 200 to 1000). I expected that datastore would not re-create records of the same links.
Also thanks for your patience 🙂Data store doesn't have deduplication.
So you need to build this on your own.
If I run the same spider on 100 times, with the same urls . It will write the same data 100 times.
Since you are combining multiple runs, you need to create your own deduplication system .
@NeoNomade just advanced to level 12! Thanks for your contributions! 🎉
The most basic one would be, to write record of the urls that have been "used" in a JSON file, and everytime you start a new run, read that JSON file before, in order to be able to deduplicate .
More advanced stuff would involve a minimal database and reading and writing from there .
Oh! I understand. Thank you so much!
@Alex I just advanced to level 1! Thanks for your contributions! 🎉
No problem. Anytime