i have a constantly updated list of companies which is like veeeery long. what i need is to use openclaw to search the internet for job posts for these companies however i dont know how to do that in an adequate way because once i try to run it on the hwole list it just runs out of limits for web search especially. idk what to do. For such a web intensive task how do people do it usually?
#Very large job
1 messages · Page 1 of 1 (latest)
Brute-forcing web_search once per company doesn’t scale — you hit search-provider quota long before you hit “compute”.
What people usually do instead is:
1) Use search only for discovery, then switch to direct sources
Do web_search once per company to find the canonical careers source (and store it), e.g.
- their own
/careerspage - an ATS-hosted board (Greenhouse / Lever / Workday / etc.)
After that, your daily runs should mostly be web_fetch against the known URL(s), not new searches every time.
Why this works: “search” is expensive + quota-limited; fetching a known page is cheap and predictable.
2) Exploit ATS patterns (huge win)
A big chunk of companies use a handful of ATS platforms that expose structured pages/JSON. If you detect “this is a Greenhouse board” or “this is Lever”, you can pull postings directly (often as JSON) and skip search entirely.
So your pipeline becomes:
- Company → careers URL (saved)
- careers URL → postings (fetched + parsed)
- postings → diff against “already seen” → only emit new ones
3) Incremental + stateful runs (don’t scan the whole list every time)
Instead of “run on the whole list”, do:
- keep a queue with
last_checked_atper company - each cron run processes the next N companies (say 25–100), then stops
- repeat every hour/day
This keeps you within quotas and still keeps the list “fresh enough”.
OpenClaw side: cron is the right mechanism for this kind of batch/rotation:
4) Reduce queries by batching + narrowing
If you must use search:
- only search “new companies” (ones you haven’t mapped to a careers URL yet)
- use time filters (past week/day) so results are small and you don’t re-process old posts
- restrict with
site:to likely boards (this improves signal so you need fewer retries)
5) Accept that “internet-wide search at scale” usually means paying for data
If you need thousands of searches/day, people typically:
- buy a higher-quota search API plan, or
- use a jobs data provider/aggregator API, or
- run their own crawler against known sources (not general web search)
Quick questions (so I can suggest the best shape)
- Roughly how many companies? (100 / 1k / 10k?)
- How often do you need updates (hourly / daily)?
- Are you OK with “only new postings since last run” (vs full rescan)?
If you answer those, I can outline a concrete OpenClaw workflow (discovery job + monitoring job + storage format) that stays inside quotas.