sensitive-blue•3y ago
To eliminate duplicates of "request retries," may need to set a "timeout" between them?
The issue is that when the "job" fails, it gets restarted as many times as specified in "maxRequestRetries." However, if the restarted "jobs" are successful, I end up with multiple identical results in the output, whereas I only need one.
For example: the first job fails, and it gets restarted (which is intended), but since it successfully restarts, for instance, two times, I receive two identical results. But I actually need only one result.
2 Replies
sensitive-blueOP•3y ago
For example: input length = 200 links, output = 215 objects (must be 200 ).
rising-crimson•3y ago
Hey there! Request queue deduplicates the URLs, but I see you're explicitly setting the uniqueKey for the requests - why? From what I see the problem is that there are probably duplicate URLs which are fed to crawler as different requests, and thus when they succeed - they are expectedly produce duplicates in the dataset.