wise-white•2y ago
Can’t find info on url base of a crawler
how in Crawlee (Playwright) I can handle links that were already visited by the crawler? I mean, how not to repeat the links it already handled from disc memory? is
purgeRequestQueue: false
sufficient? I just don't purge the Data already done and it is handled automatically?
so for example I can crawl in chunks - 50 first urls collected from the page dynamically during the crawl run; but during second run those 50 urls would be omitted and 50 next, etc.13 Replies
Hello @akephalos, someone from team will answer to your query soon. thanks:)
wise-whiteOP•2y ago
thanks
hey, try setting
purgeOnStart
to false in the global config: https://crawlee.dev/api/core/class/Configuration#:%7E:text=CRAWLEE_PURGE_ON_STARTConfiguration | API | Crawlee
Configuration
is a value object holding Crawlee configuration. By default, there is a
global singleton instance of this class available via Configuration.getGlobalConfig()
.
Places that depend on a configurable behaviour depend on this class, as they have the global
instance as the default value.
Using global configuration:
```js
import { ...wise-whiteOP•2y ago
How the links that failed and reached
failedRequestHandler
can be re-visited in that way? Is this done automatically?If you do not purge the requestQueue then the crawler will just continue where it stopped before. Requests that reached failedRequestHandler are considered done so they will not be revisited.
wise-whiteOP•2y ago
how to revisit them?
you can put them again in the queue in the failedRequestHandler
under different uniqueKey of course
wise-whiteOP•2y ago
thank you ill try it out
@akephalos just advanced to level 1! Thanks for your contributions! 🎉
wise-whiteOP•2y ago
@HonzaS can i have two different datasets, for successful and failed links? and then go back to the failed DataSet and enqeueue it again?
how the failed links are meant to be handled? just forget about them? or are they revisited by some option im missing?
You can create almost any logic so it is up to you what to do with failed links. They failed after all the retries for some reason. So for example you can put those failed requests to another named request queue and then fix the code of the crawler so they will not fail again and use this failed request queue for another run. There is no option out of the box how to revisit the failed requests that I know of.
They are considered handled because they failed many times so there is nothing the crawler can do but to skip them to continue the crawling. Otherwise there would be infinite loop and the crawler would never finish.
wise-whiteOP•2y ago
how to reuse the request queue? i just add a link to it and it will be persisted through different runs?
can i spefically pick those out? like from different dataset or else?
This actor might me handy:
https://apify.com/lukaskrivka/rebirth-failed-requests