xenophobic-harlequin•2y ago

Persist the RequestQueue (avoiding starting over)

Is it possible to persist the RequestQueue so whenever restarting a script instead of starting everything from scratch it would just keep on scrapping from the urls in the queue? I know CRAWLEE_PURGE_ON_START exists but not sure if if affects the RequestQueue. Also since an initialUrl is passed to await crawler.run([initialUrl]) the idea would be to skip that in case the RequestQueue already has urls. Is that possible?

3 Replies

HonzaS•2y ago

1. it affects the requestQueue 2. requestQueue is deduplicating itself when inserting, so it will not insert request with the same url again {unless you change its uniqueKey which is its url by default]

xenophobic-harlequinOP•2y ago

Thank you @HonzaS ! What if requestHandler runs but something fails and I want to run it again on the same URL, what should I do?

HonzaS•2y ago

add it with different uniqueKey

Persist the RequestQueue (avoiding starting over)

Did you find this page helpful?