optimistic-gold•13mo ago

remove uniqueKey from queue blacklist

Hi all, Im scraping a weird website which has file attachment links which rotate every 5 minutes or so, eg https://dca-global.org/serve-file/e1725459845/l1714045338/da/c1/Rvfa9Lo-AzHHX0NYJ3f-Tx3FrxSI8-N-Y5ytfS8Prak/1/37/file/1704801502dca_media_publications_2024_v10_lr%20read%20only.pdf everything between serve-file and file changes regularly. My strategy to deal with this is to calculate the unique key based on the 'stable' parts of the url. Then when i detect the url has changed, I can remove any queued requests with the unique key and replace them with the new url My question is, if a request has hit its retry limit and has been 'blacklisted' from the request queue, how can i remove it so the new url can be processed? Thanks!

5 Replies

Hall•13mo ago

Post created!

This post has been synced with the Apify community site and will be indexed by search engines

Pepa J•12mo ago

Hi @Crafty , You can remove request from RequestQueue based on its id through the API. If you know only uniqueKey, then I suggest you to add new Request to the RequestQueue with the same uniqueKey -> It will end with response aving attribute wasAlreadyPresent set to true , but you should also obtain the stored Request data (with id). When you have the Request id, you may do DELETE Http Request see https://docs.apify.com/api/v2#tag/Request-queuesQueue/operation/requestQueue_request_delete to delete it.

optimistic-goldOP•12mo ago

hi @Pepa J , thanks for the help. I an actually working with only crawlee and not apify but i found a method along the same lines. May i suggest a feature for ethier the request queue or request queue client to more easily query a request from its uniqueId?

    const requestQueue = await crawler.getRequestQueue();
    const result = await requestQueue.addRequest({url: 'https://google.com', uniqueKey: 'aaa', label: 'secondary'})
    log.info('result', result)
    if (result.wasAlreadyPresent) {
      log.info('already present')
      const request = await requestQueue.getRequest(result.requestId);
      log.info('request', request)
    }

    const requestQueue = await crawler.getRequestQueue();
    const result = await requestQueue.addRequest({url: 'https://google.com', uniqueKey: 'aaa', label: 'secondary'})
    log.info('result', result)
    if (result.wasAlreadyPresent) {
      log.info('already present')
      const request = await requestQueue.getRequest(result.requestId);
      log.info('request', request)
    }

Pepa J•12mo ago

@Crafty Ah I am sorry for the API mention. I believe there are some architectural decisions around this. What you can do is to create a map in memory and save the uniqueKey->requestId relation as a key pair value there. 🤔

optimistic-goldOP•12mo ago

ah ok sounds good. it would be interesting to know if there is a reason for it, surely addRequest must be doing it under the hood.

remove uniqueKey from queue blacklist

Did you find this page helpful?