wise-white
wise-white2y ago

Can’t find info on url base of a crawler

how in Crawlee (Playwright) I can handle links that were already visited by the crawler? I mean, how not to repeat the links it already handled from disc memory? is purgeRequestQueue: false sufficient? I just don't purge the Data already done and it is handled automatically? so for example I can crawl in chunks - 50 first urls collected from the page dynamically during the crawl run; but during second run those 50 urls would be omitted and 50 next, etc.
13 Replies
Saurav Jain
Saurav Jain2y ago
Hello @akephalos, someone from team will answer to your query soon. thanks:)
wise-white
wise-whiteOP2y ago
thanks
lemurio
lemurio2y ago
hey, try setting purgeOnStart to false in the global config: https://crawlee.dev/api/core/class/Configuration#:%7E:text=CRAWLEE_PURGE_ON_START
Configuration | API | Crawlee
Configuration is a value object holding Crawlee configuration. By default, there is a global singleton instance of this class available via Configuration.getGlobalConfig(). Places that depend on a configurable behaviour depend on this class, as they have the global instance as the default value. Using global configuration: ```js import { ...
wise-white
wise-whiteOP2y ago
How the links that failed and reached failedRequestHandler can be re-visited in that way? Is this done automatically?
HonzaS
HonzaS2y ago
If you do not purge the requestQueue then the crawler will just continue where it stopped before. Requests that reached failedRequestHandler are considered done so they will not be revisited.
wise-white
wise-whiteOP2y ago
how to revisit them?
HonzaS
HonzaS2y ago
you can put them again in the queue in the failedRequestHandler under different uniqueKey of course
wise-white
wise-whiteOP2y ago
thank you ill try it out
MEE6
MEE62y ago
@akephalos just advanced to level 1! Thanks for your contributions! 🎉
wise-white
wise-whiteOP2y ago
@HonzaS can i have two different datasets, for successful and failed links? and then go back to the failed DataSet and enqeueue it again? how the failed links are meant to be handled? just forget about them? or are they revisited by some option im missing?
HonzaS
HonzaS2y ago
You can create almost any logic so it is up to you what to do with failed links. They failed after all the retries for some reason. So for example you can put those failed requests to another named request queue and then fix the code of the crawler so they will not fail again and use this failed request queue for another run. There is no option out of the box how to revisit the failed requests that I know of. They are considered handled because they failed many times so there is nothing the crawler can do but to skip them to continue the crawling. Otherwise there would be infinite loop and the crawler would never finish.
wise-white
wise-whiteOP2y ago
how to reuse the request queue? i just add a link to it and it will be persisted through different runs? can i spefically pick those out? like from different dataset or else?

Did you find this page helpful?