#Managing duplicate queries using RequestQueue but it seems off.

1 messages · Page 1 of 1 (latest)

final canyon
#

Description

It appears that my custom RequestQueue isn't working as expected. Very few jobs are being processed, even though my RequestQueue list has many more job IDs.

import { RequestQueue } from "crawlee";

let jobQueue: RequestQueue;
async function initializeJobQueue() {
  if (!jobQueue) {
    jobQueue = await RequestQueue.open("job-deduplication-queue");
  }
}

async function fetchJobPages(page: Page, jobIds: string[], origin: string) {
  await initializeJobQueue();

  const filteredJobIds = [];
  if (saveOnlyUniqueItems) {
    for (const jobId of jobIds) {
      const jobUrl = `${origin}/viewjob?jk=${jobId}`;
      const request = await jobQueue.addRequest({ url: jobUrl });
      if (!request.wasAlreadyPresent) filteredJobIds.push(jobId);
    }
  } else {
    filteredJobIds.push(...jobIds);
  }

  myLog(
    `Filtered ${jobIds.length - filteredJobIds.length} duplicates, ` +
    `processing ${filteredJobIds.length} unique jobs.`
  );

  // fetchJobWithRetry and batching logic follows...
}

Am i using the request correctly, I am not using the default one from the crawler because my scrapping logic does not allow it.

hard fjordBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

final canyon
#

Managing duplicate queries using RequestQueue but it seems off.

final canyon
#

wow just saw the bug actually running the crawler with
apify run --purge

is not purging all the request_queues

#

so i was storing the request from previous runs and thus everything was considered as duplicate

#

how to purge that automatically?

frozen ridge
final canyon
vale pawnBOT
#

@final canyon just advanced to level 3! Thanks for your contributions! 🎉

torpid halo
#

Hi, apify run --purge only clears the default storages, so any named request queues you create will not be removed.

final canyon
torpid halo
#

It really depends on the use case and the code.

If you would publish an actor on the Apify platform with a named RequestQueue (e.g. "job‑deduplication‑queue") it means that it will persist exactly as is between runs. Every new invocation of your actor will reopen the same request queue and nothing in it is deleted automatically. If you just need it for a single run you should be using the unnamed request queue.

Additionally if you are just trying to avoid duplicate requests you can use the useExtendedUniqueKey or uniqueKey when enqueuing new request.
You can get more info about these here https://crawlee.dev/js/api/core/interface/RequestOptions#useExtendedUniqueKey

If you need to track this across different runs, then you could also use the named key value store with the stored ids rather then using request queue.

Crawlee helps you build and maintain your crawlers. It's open source, but built by developers who scrape millions of pages every day for a living.

final canyon