Managing duplicate queries using RequestQueue but it seems off.
Description
It appears that my custom RequestQueue isn't working as expected. Very few jobs are being processed, even though my RequestQueue list has many more job IDs.
Am i using the request correctly, I am not using the default one from the crawler because my scrapping logic does not allow it.
9 Replies
wow just saw the bug actually running the crawler with
apify run --purge
is not purging all the request_queues
so i was storing the request from previous runs and thus everything was considered as duplicate
how to purge that automatically?
apify run --purge does that but sometimes it doesn't work(rarerly only in some runs I incurred that) you can use rm to do before the run
On my end it only deletes the default folder not the custom one
@fierDeToiMonGrand just advanced to level 3! Thanks for your contributions! 🎉
Hi, apify run --purge only clears the default storages, so any named request queues you create will not be removed.
Will it cause problem if the actor is shipped on apify for user?
Will the non default storage will be deleted for every new run on apify, or it will remain with the same folder for every run?
Also is this the correct way to manage duplicates query for my crawler without relying of the default request queue from the crawler? i.e is it safe from race condition?
It really depends on the use case and the code.
If you would publish an actor on the Apify platform with a named RequestQueue (e.g. "job‑deduplication‑queue") it means that it will persist exactly as is between runs. Every new invocation of your actor will reopen the same request queue and nothing in it is deleted automatically. If you just need it for a single run you should be using the unnamed request queue.
Additionally if you are just trying to avoid duplicate requests you can use the
useExtendedUniqueKey
or uniqueKey
when enqueuing new request.
You can get more info about these here https://crawlee.dev/js/api/core/interface/RequestOptions#useExtendedUniqueKey
If you need to track this across different runs, then you could also use the named key value store with the stored ids rather then using request queue.RequestOptions | API | Crawlee for JavaScript · Build reliable cra...
Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.
is the key value store safe from race condition?
No. Two runs can overwrite the same key if they write at the same time