wise-white•2y ago

Can’t find info on url base of a crawler

how in Crawlee (Playwright) I can handle links that were already visited by the crawler? I mean, how not to repeat the links it already handled from disc memory? is purgeRequestQueue: false sufficient? I just don't purge the Data already done and it is handled automatically? so for example I can crawl in chunks - 50 first urls collected from the page dynamically during the crawl run; but during second run those 50 urls would be omitted and 50 next, etc.

13 Replies

Saurav Jain•2y ago

Hello @akephalos, someone from team will answer to your query soon. thanks:)

wise-whiteOP•2y ago

thanks

lemurio•2y ago

hey, try setting purgeOnStart to false in the global config: https://crawlee.dev/api/core/class/Configuration#:%7E:text=CRAWLEE_PURGE_ON_START

Configuration | API | Crawlee

Configuration is a value object holding Crawlee configuration. By default, there is a global singleton instance of this class available via Configuration.getGlobalConfig(). Places that depend on a configurable behaviour depend on this class, as they have the global instance as the default value. Using global configuration: ```js import { ...

wise-whiteOP•2y ago

How the links that failed and reached failedRequestHandler can be re-visited in that way? Is this done automatically?

HonzaS•2y ago

If you do not purge the requestQueue then the crawler will just continue where it stopped before. Requests that reached failedRequestHandler are considered done so they will not be revisited.

wise-whiteOP•2y ago

how to revisit them?

HonzaS•2y ago

you can put them again in the queue in the failedRequestHandler under different uniqueKey of course

wise-whiteOP•2y ago

thank you ill try it out

MEE6•2y ago

@akephalos just advanced to level 1! Thanks for your contributions! 🎉

wise-whiteOP•2y ago

@HonzaS can i have two different datasets, for successful and failed links? and then go back to the failed DataSet and enqeueue it again? how the failed links are meant to be handled? just forget about them? or are they revisited by some option im missing?

HonzaS•2y ago

You can create almost any logic so it is up to you what to do with failed links. They failed after all the retries for some reason. So for example you can put those failed requests to another named request queue and then fix the code of the crawler so they will not fail again and use this failed request queue for another run. There is no option out of the box how to revisit the failed requests that I know of. They are considered handled because they failed many times so there is nothing the crawler can do but to skip them to continue the crawling. Otherwise there would be infinite loop and the crawler would never finish.

wise-whiteOP•2y ago

how to reuse the request queue? i just add a link to it and it will be persisted through different runs? can i spefically pick those out? like from different dataset or else?

Oleg V.•2y ago

This actor might me handy: https://apify.com/lukaskrivka/rebirth-failed-requests

Can’t find info on url base of a crawler

Did you find this page helpful?