optimistic-gold•13mo ago
remove uniqueKey from queue blacklist
Hi all,
Im scraping a weird website which has file attachment links which rotate every 5 minutes or so, eg
https://dca-global.org/serve-file/e1725459845/l1714045338/da/c1/Rvfa9Lo-AzHHX0NYJ3f-Tx3FrxSI8-N-Y5ytfS8Prak/1/37/file/1704801502dca_media_publications_2024_v10_lr%20read%20only.pdf
everything between
serve-file
and file
changes regularly.
My strategy to deal with this is to calculate the unique key based on the 'stable' parts of the url. Then when i detect the url has changed, I can remove any queued requests with the unique key and replace them with the new url
My question is, if a request has hit its retry limit and has been 'blacklisted' from the request queue, how can i remove it so the new url can be processed?
Thanks!5 Replies
Post created!
This post has been synced with the Apify community site and will be indexed by search engines
Hi @Crafty ,
You can remove request from RequestQueue based on its
id
through the API.
If you know only uniqueKey
, then I suggest you to add new Request
to the RequestQueue
with the same uniqueKey
-> It will end with response aving attribute wasAlreadyPresent
set to true , but you should also obtain the stored Request data (with id
).
When you have the Request id
, you may do DELETE Http Request see https://docs.apify.com/api/v2#tag/Request-queuesQueue/operation/requestQueue_request_delete to delete it.optimistic-goldOP•12mo ago
hi @Pepa J , thanks for the help. I an actually working with only crawlee and not apify but i found a method along the same lines. May i suggest a feature for ethier the request queue or request queue client to more easily query a request from its uniqueId?
@Crafty Ah I am sorry for the API mention.
I believe there are some architectural decisions around this. What you can do is to create a map in memory and save the
uniqueKey->requestId
relation as a key pair value there. 🤔optimistic-goldOP•12mo ago
ah ok sounds good. it would be interesting to know if there is a reason for it, surely
addRequest
must be doing it under the hood.