optimistic-gold
optimistic-gold•13mo ago

remove uniqueKey from queue blacklist

Hi all, Im scraping a weird website which has file attachment links which rotate every 5 minutes or so, eg https://dca-global.org/serve-file/e1725459845/l1714045338/da/c1/Rvfa9Lo-AzHHX0NYJ3f-Tx3FrxSI8-N-Y5ytfS8Prak/1/37/file/1704801502dca_media_publications_2024_v10_lr%20read%20only.pdf everything between serve-file and file changes regularly. My strategy to deal with this is to calculate the unique key based on the 'stable' parts of the url. Then when i detect the url has changed, I can remove any queued requests with the unique key and replace them with the new url My question is, if a request has hit its retry limit and has been 'blacklisted' from the request queue, how can i remove it so the new url can be processed? Thanks!
5 Replies
Hall
Hall•13mo ago
Post created!
This post has been synced with the Apify community site and will be indexed by search engines
Pepa J
Pepa J•12mo ago
Hi @Crafty , You can remove request from RequestQueue based on its id through the API. If you know only uniqueKey, then I suggest you to add new Request to the RequestQueue with the same uniqueKey -> It will end with response aving attribute wasAlreadyPresent set to true , but you should also obtain the stored Request data (with id). When you have the Request id, you may do DELETE Http Request see https://docs.apify.com/api/v2#tag/Request-queuesQueue/operation/requestQueue_request_delete to delete it.
optimistic-gold
optimistic-goldOP•12mo ago
hi @Pepa J , thanks for the help. I an actually working with only crawlee and not apify but i found a method along the same lines. May i suggest a feature for ethier the request queue or request queue client to more easily query a request from its uniqueId?
const requestQueue = await crawler.getRequestQueue();
const result = await requestQueue.addRequest({url: 'https://google.com', uniqueKey: 'aaa', label: 'secondary'})
log.info('result', result)
if (result.wasAlreadyPresent) {
log.info('already present')
const request = await requestQueue.getRequest(result.requestId);
log.info('request', request)
}
const requestQueue = await crawler.getRequestQueue();
const result = await requestQueue.addRequest({url: 'https://google.com', uniqueKey: 'aaa', label: 'secondary'})
log.info('result', result)
if (result.wasAlreadyPresent) {
log.info('already present')
const request = await requestQueue.getRequest(result.requestId);
log.info('request', request)
}
Pepa J
Pepa J•12mo ago
@Crafty Ah I am sorry for the API mention. I believe there are some architectural decisions around this. What you can do is to create a map in memory and save the uniqueKey->requestId relation as a key pair value there. 🤔
optimistic-gold
optimistic-goldOP•12mo ago
ah ok sounds good. it would be interesting to know if there is a reason for it, surely addRequest must be doing it under the hood.

Did you find this page helpful?