#Very large job

1 messages · Page 1 of 1 (latest)

chrome jay
#

i have a constantly updated list of companies which is like veeeery long. what i need is to use openclaw to search the internet for job posts for these companies however i dont know how to do that in an adequate way because once i try to run it on the hwole list it just runs out of limits for web search especially. idk what to do. For such a web intensive task how do people do it usually?

glossy elmBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

hasty nacelleBOT
#

Brute-forcing web_search once per company doesn’t scale — you hit search-provider quota long before you hit “compute”.

What people usually do instead is:

1) Use search only for discovery, then switch to direct sources

Do web_search once per company to find the canonical careers source (and store it), e.g.

  • their own /careers page
  • an ATS-hosted board (Greenhouse / Lever / Workday / etc.)

After that, your daily runs should mostly be web_fetch against the known URL(s), not new searches every time.

Why this works: “search” is expensive + quota-limited; fetching a known page is cheap and predictable.

2) Exploit ATS patterns (huge win)

A big chunk of companies use a handful of ATS platforms that expose structured pages/JSON. If you detect “this is a Greenhouse board” or “this is Lever”, you can pull postings directly (often as JSON) and skip search entirely.

So your pipeline becomes:

#
  1. Company → careers URL (saved)
  2. careers URL → postings (fetched + parsed)
  3. postings → diff against “already seen” → only emit new ones

3) Incremental + stateful runs (don’t scan the whole list every time)

Instead of “run on the whole list”, do:

  • keep a queue with last_checked_at per company
  • each cron run processes the next N companies (say 25–100), then stops
  • repeat every hour/day

This keeps you within quotas and still keeps the list “fresh enough”.

OpenClaw side: cron is the right mechanism for this kind of batch/rotation:

#

4) Reduce queries by batching + narrowing

If you must use search:

  • only search “new companies” (ones you haven’t mapped to a careers URL yet)
  • use time filters (past week/day) so results are small and you don’t re-process old posts
  • restrict with site: to likely boards (this improves signal so you need fewer retries)

5) Accept that “internet-wide search at scale” usually means paying for data

If you need thousands of searches/day, people typically:

  • buy a higher-quota search API plan, or
  • use a jobs data provider/aggregator API, or
  • run their own crawler against known sources (not general web search)

Quick questions (so I can suggest the best shape)

  1. Roughly how many companies? (100 / 1k / 10k?)
  2. How often do you need updates (hourly / daily)?
  3. Are you OK with “only new postings since last run” (vs full rescan)?
#

If you answer those, I can outline a concrete OpenClaw workflow (discovery job + monitoring job + storage format) that stays inside quotas.

chrome jay
#

i got this in openclaw with openai codex 5.2. how can i check what limit exactly got reached and what the limits are in general? " run error: ⚠️ API rate limit reached. Please try again later."

#

?