Managing duplicate queries using RequestQueue but it seems off. | Apify & Crawlee | Page 1

final canyon Jul 17, 2025, 8:11 PM

#

Description

It appears that my custom RequestQueue isn't working as expected. Very few jobs are being processed, even though my RequestQueue list has many more job IDs.

import { RequestQueue } from "crawlee";

let jobQueue: RequestQueue;
async function initializeJobQueue() {
  if (!jobQueue) {
    jobQueue = await RequestQueue.open("job-deduplication-queue");
  }
}

async function fetchJobPages(page: Page, jobIds: string[], origin: string) {
  await initializeJobQueue();

  const filteredJobIds = [];
  if (saveOnlyUniqueItems) {
    for (const jobId of jobIds) {
      const jobUrl = `${origin}/viewjob?jk=${jobId}`;
      const request = await jobQueue.addRequest({ url: jobUrl });
      if (!request.wasAlreadyPresent) filteredJobIds.push(jobId);
    }
  } else {
    filteredJobIds.push(...jobIds);
  }

  myLog(
    `Filtered ${jobIds.length - filteredJobIds.length} duplicates, ` +
    `processing ${filteredJobIds.length} unique jobs.`
  );

  // fetchJobWithRetry and batching logic follows...
}

Am i using the request correctly, I am not using the default one from the crawler because my scrapping logic does not allow it.

hard fjordBOT Jul 17, 2025, 8:11 PM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

final canyon Jul 17, 2025, 8:11 PM

#

Managing duplicate queries using RequestQueue but it seems off.

final canyon Jul 17, 2025, 9:37 PM

#

wow just saw the bug actually running the crawler with
apify run --purge

is not purging all the request_queues

#

so i was storing the request from previous runs and thus everything was considered as duplicate

#

how to purge that automatically?

frozen ridge Jul 18, 2025, 3:25 PM

#

final canyon how to purge that automatically?

apify run --purge does that but sometimes it doesn't work(rarerly only in some runs I incurred that) you can use rm to do before the run

final canyon Jul 19, 2025, 3:01 PM

#

frozen ridge apify run --purge does that but sometimes it doesn't work(rarerly only in some r...

On my end it only deletes the default folder not the custom one

vale pawnBOT Jul 19, 2025, 3:01 PM

#

@final canyon just advanced to level 3! Thanks for your contributions! 🎉

torpid halo Jul 22, 2025, 2:21 PM

#

Hi, apify run --purge only clears the default storages, so any named request queues you create will not be removed.

final canyon Jul 22, 2025, 2:50 PM

#

torpid halo Hi, apify run --purge only clears the default storages, so any named request que...

Will it cause problem if the actor is shipped on apify for user?
Will the non default storage will be deleted for every new run on apify, or it will remain with the same folder for every run?

Also is this the correct way to manage duplicates query for my crawler without relying of the default request queue from the crawler? i.e is it safe from race condition?

torpid halo Jul 23, 2025, 8:39 AM

#

It really depends on the use case and the code.

If you would publish an actor on the Apify platform with a named RequestQueue (e.g. "job‑deduplication‑queue") it means that it will persist exactly as is between runs. Every new invocation of your actor will reopen the same request queue and nothing in it is deleted automatically. If you just need it for a single run you should be using the unnamed request queue.

Additionally if you are just trying to avoid duplicate requests you can use the useExtendedUniqueKey or uniqueKey when enqueuing new request.
You can get more info about these here https://crawlee.dev/js/api/core/interface/RequestOptions#useExtendedUniqueKey

If you need to track this across different runs, then you could also use the named key value store with the stored ids rather then using request queue.

RequestOptions | API | Crawlee for JavaScript · Build reliable cra...

Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.

final canyon Jul 24, 2025, 8:42 PM

#

torpid halo It really depends on the use case and the code. If you would publish an actor ...

is the key value store safe from race condition?

#Managing duplicate queries using RequestQueue but it seems off.

Description