xenophobic-harlequin•2y ago
Persist the RequestQueue (avoiding starting over)
Is it possible to persist the RequestQueue so whenever restarting a script instead of starting everything from scratch it would just keep on scrapping from the urls in the queue? I know
CRAWLEE_PURGE_ON_START
exists but not sure if if affects the RequestQueue. Also since an initialUrl is passed to await crawler.run([initialUrl])
the idea would be to skip that in case the RequestQueue already has urls. Is that possible?3 Replies
1. it affects the requestQueue
2. requestQueue is deduplicating itself when inserting, so it will not insert request with the same url again {unless you change its uniqueKey which is its url by default]
xenophobic-harlequinOP•2y ago
Thank you @HonzaS ! What if requestHandler runs but something fails and I want to run it again on the same URL, what should I do?
add it with different uniqueKey