vicious-gold
vicious-gold2y ago

Currently i'm running the Google Crawler

Currently i'm running the Google Crawler, i'm at 60.000 requests and noticed that there is a Search Term in the list I want to skip. Currently (as far as i could tell) there is no way to stop the run, edit settings, resurrect the run. In addition, i also can't stop the run, edit settings, and start a new run with a setting like 'Don't crawl pages already crawled in run #x'. Hence leaving me with only 2 options, stop the run and start again (costly) letting it run with the unwanted term (costly as well). [Add an option, to save all crawled urls of an actor -on a central place- and adding the setting 'don't run those urls again' would really be a huge improvement in cases like this Also in cases where the Actor can't crawl a whole country at once (e.g. per city), you unavoidably crawl duplicate urls (overlap between cities) in each crawl (costly, in both $ and time); the function above will also be a great improvement for those cases. Cheers.
1 Reply
Lukas Krivka
Lukas Krivka2y ago
This can be technically done by aborting, then using API removing the unwanted requests from the queue and then resurrecting. But it requires knowledge of the API and a bit about the specific Actor. If you hit me privately, I might be able to help

Did you find this page helpful?