fierDeToiMonGrand•2mo ago

Managing duplicate queries using RequestQueue but it seems off.

Description It appears that my custom RequestQueue isn't working as expected. Very few jobs are being processed, even though my RequestQueue list has many more job IDs.

import { RequestQueue } from "crawlee";

let jobQueue: RequestQueue;
async function initializeJobQueue() {
  if (!jobQueue) {
    jobQueue = await RequestQueue.open("job-deduplication-queue");
  }
}

async function fetchJobPages(page: Page, jobIds: string[], origin: string) {
  await initializeJobQueue();

  const filteredJobIds = [];
  if (saveOnlyUniqueItems) {
    for (const jobId of jobIds) {
      const jobUrl = `${origin}/viewjob?jk=${jobId}`;
      const request = await jobQueue.addRequest({ url: jobUrl });
      if (!request.wasAlreadyPresent) filteredJobIds.push(jobId);
    }
  } else {
    filteredJobIds.push(...jobIds);
  }

  myLog(
    `Filtered ${jobIds.length - filteredJobIds.length} duplicates, ` +
    `processing ${filteredJobIds.length} unique jobs.`
  );

  // fetchJobWithRetry and batching logic follows...
}

import { RequestQueue } from "crawlee";

let jobQueue: RequestQueue;
async function initializeJobQueue() {
  if (!jobQueue) {
    jobQueue = await RequestQueue.open("job-deduplication-queue");
  }
}

async function fetchJobPages(page: Page, jobIds: string[], origin: string) {
  await initializeJobQueue();

  const filteredJobIds = [];
  if (saveOnlyUniqueItems) {
    for (const jobId of jobIds) {
      const jobUrl = `${origin}/viewjob?jk=${jobId}`;
      const request = await jobQueue.addRequest({ url: jobUrl });
      if (!request.wasAlreadyPresent) filteredJobIds.push(jobId);
    }
  } else {
    filteredJobIds.push(...jobIds);
  }

  myLog(
    `Filtered ${jobIds.length - filteredJobIds.length} duplicates, ` +
    `processing ${filteredJobIds.length} unique jobs.`
  );

  // fetchJobWithRetry and batching logic follows...
}

Am i using the request correctly, I am not using the default one from the crawler because my scrapping logic does not allow it.

9 Replies

fierDeToiMonGrandOP•2mo ago

wow just saw the bug actually running the crawler with apify run --purge is not purging all the request_queues so i was storing the request from previous runs and thus everything was considered as duplicate how to purge that automatically?

thenetaji•2mo ago

apify run --purge does that but sometimes it doesn't work(rarerly only in some runs I incurred that) you can use rm to do before the run

fierDeToiMonGrandOP•2mo ago

On my end it only deletes the default folder not the custom one

MEE6•2mo ago

@fierDeToiMonGrand just advanced to level 3! Thanks for your contributions! 🎉

Lukas Celnar•2mo ago

Hi, apify run --purge only clears the default storages, so any named request queues you create will not be removed.

fierDeToiMonGrandOP•2mo ago

Will it cause problem if the actor is shipped on apify for user? Will the non default storage will be deleted for every new run on apify, or it will remain with the same folder for every run? Also is this the correct way to manage duplicates query for my crawler without relying of the default request queue from the crawler? i.e is it safe from race condition?

Lukas Celnar•2mo ago

It really depends on the use case and the code. If you would publish an actor on the Apify platform with a named RequestQueue (e.g. "job‑deduplication‑queue") it means that it will persist exactly as is between runs. Every new invocation of your actor will reopen the same request queue and nothing in it is deleted automatically. If you just need it for a single run you should be using the unnamed request queue. Additionally if you are just trying to avoid duplicate requests you can use the useExtendedUniqueKey or uniqueKey when enqueuing new request. You can get more info about these here https://crawlee.dev/js/api/core/interface/RequestOptions#useExtendedUniqueKey If you need to track this across different runs, then you could also use the named key value store with the stored ids rather then using request queue.

RequestOptions | API | Crawlee for JavaScript · Build reliable cra...

Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.

fierDeToiMonGrandOP•2mo ago

is the key value store safe from race condition?

Lukas Celnar•2mo ago

No. Two runs can overwrite the same key if they write at the same time

Managing duplicate queries using RequestQueue but it seems off.

Did you find this page helpful?