xenophobic-harlequin
xenophobic-harlequin2y ago

Persist the RequestQueue (avoiding starting over)

Is it possible to persist the RequestQueue so whenever restarting a script instead of starting everything from scratch it would just keep on scrapping from the urls in the queue? I know CRAWLEE_PURGE_ON_START exists but not sure if if affects the RequestQueue. Also since an initialUrl is passed to await crawler.run([initialUrl]) the idea would be to skip that in case the RequestQueue already has urls. Is that possible?
3 Replies
HonzaS
HonzaS2y ago
1. it affects the requestQueue 2. requestQueue is deduplicating itself when inserting, so it will not insert request with the same url again {unless you change its uniqueKey which is its url by default]
xenophobic-harlequin
xenophobic-harlequinOP2y ago
Thank you @HonzaS ! What if requestHandler runs but something fails and I want to run it again on the same URL, what should I do?
HonzaS
HonzaS2y ago
add it with different uniqueKey

Did you find this page helpful?