#Backoff during cooldown

1 messages · Page 1 of 1 (latest)

unreal shore
#

I am on openclaw version 2026-04-26

I restarted the gateway and triggered 2 openclaw cron jobs. The cron jobs do not have any model overrides configured.

It seems that the openclaw cron jobs are hammering the gateway so it slows down to molasses and eventually the "openclaw logs --follow" is shut off with a message that the gateway might not be accessible but the agents are still alive in discod.

...
10:23:08+00:00 warn agent/embedded {"subsystem":"agent/embedded"} {"event":"embedded_run_agent_end","tags":["error_handling","lifecycle","agent_end","assistant_error"],"runId":"0cf1f342-5918-4e25-840a-bd6c1a3398f8","isError":true,"error":"⚠️ You've reached your usage limit for this billing cycle. Your quota will be refreshed in the next cycle. Upgrade to get more: https://www.kimi.com/code/console?from=quota-upgrade","failoverReason":"auth","model":"kimi-for-coding","provider":"kimi","rawErrorPreview":"403 {"error":{"type":"permission_error","message":"You've reached your usage limit for this billing cycle. Your quota will be refreshed in the next cycle. Upgrade to get more: https://www.kimi.com/code/console?from=quota-upgrade\"},\"type\":\"error\"}","rawErrorHash":"sha256:0c5ebffb114c","rawErrorFingerprint":"sha256:7e534ef0f2b8","httpCode":"403","providerRuntimeFailureKind":"unclassified","providerErrorType":"permission_error","providerErrorMessagePreview":"You've reached your usage limit for this billing cycle. Your quota will be refreshed in the next cycle. Upgrade to get more: https://www.kimi.com/code/console?from=quota-upgrade"} embedded run agent end

see next post for continuation of log.

lost oarBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

unreal shore
#

10:23:08+00:00 warn agent/embedded {"subsystem":"agent/embedded"} {"event":"auth_profile_failure_state_updated","tags":["error_handling","auth_profiles","cooldown"],"runId":"0cf1f342-5918-4e25-840a-bd6c1a3398f8","profileId":"sha256:828180692adc","provider":"kimi","reason":"auth","windowType":"cooldown","windowReused":false,"previousErrorCount":1,"errorCount":1,"previousCooldownUntil":1777456678967,"cooldownUntil":1777458218899,"failureCounts":{"auth":1}} auth profile failure state updated
10:23:08+00:00 warn agent/embedded {"subsystem":"agent/embedded"} {"event":"embedded_run_failover_decision","tags":["error_handling","failover","assistant","fallback_model"],"runId":"0cf1f342-5918-4e25-840a-bd6c1a3398f8","stage":"assistant","decision":"fallback_model","failoverReason":"auth","profileFailureReason":"auth","provider":"kimi","model":"kimi-for-coding","sourceProvider":"kimi","sourceModel":"kimi-for-coding","profileId":"sha256:828180692adc","fallbackConfigured":true,"timedOut":false,"aborted":false,"status":401,"rawErrorPreview":"403 {"error":{"type":"permission_error","message":"You've reached your usage limit for this billing cycle. Your quota will be refreshed in the next cycle. Upgrade to get more: https://www.kimi.com/code/console?from=quota-upgrade\"},\"type\":\"error\"}","rawErrorHash":"sha256:0c5ebffb114c","rawErrorFingerprint":"sha256:7e534ef0f2b8","httpCode":"403","providerRuntimeFailureKind":"unclassified","providerErrorType":"permission_error","providerErrorMessagePreview":"You've reached your usage limit for this billing cycle. Your quota will be refreshed in the next cycle. Upgrade to get more: https://www.kimi.com/code/console?from=quota-upgrade"} embedded run failover decision
10:23:08+00:00 error diagnostic {"subsystem":"diagnostic"} lane task error: lane=cron-nested durationMs=219424 error="FailoverError: ⚠️ You've reached your usage limit for this billing cycle. Your quota will be refreshed in the next cycle. Upgrade to get more: ....

inland pelicanBOT
#

This smells less like “cron itself is broken” and more like cron + model fallback/retry repeatedly hitting an exhausted Kimi quota.

That log line is the key:

provider: kimi
model: kimi-for-coding
403 permission_error
You've reached your usage limit for this billing cycle

Since the cron jobs have no model override, they use the agent/default model selection + configured fallbacks. So if Kimi is your default or in agents.defaults.model.fallbacks, both cron runs can keep trying it.

Immediate containment:

#
openclaw cron list
openclaw cron edit <job-id> --disable

Then inspect the actual model path / retries:

openclaw cron show <job-id>
openclaw cron runs --id <job-id> --limit 20
openclaw config get agents.defaults.model
openclaw models status
openclaw cron status

If Kimi quota is exhausted, either:

#
openclaw cron edit <job-id> --model "<provider/model-that-has-quota>"

or remove Kimi from the default/fallback chain until quota resets.

Also worth checking/setting concurrency low:

openclaw config set cron.maxConcurrentRuns 1
openclaw gateway restart

Docs: https://docs.openclaw.ai/automation/cron-jobs and https://docs.openclaw.ai/concepts/model-failover

One important bit: the log says providerRuntimeFailureKind:"unclassified" while the text clearly looks quota/usage-limit-ish. If this still happens after updating past 2026-04-26, I’d file a GitHub issue with the redacted log + openclaw status --all: https://github.com/openclaw/openclaw/issues 🦐

#

Yep, this continuation confirms the important bit:

  • It is classifying Kimi’s “usage limit for this billing cycle” as failoverReason:"auth".
  • That puts the profile into a short auth cooldown — here only until 2026-04-29T10:23:38Z, about 30s later.
  • Then cron/fallback can try again, so two cron jobs can churn badly.

One nuance: this specific snippet does not show retrying during an active cooldown — previousCooldownUntil was 09:57:58Z, already expired. But it does show the core problem: this should probably be treated as a quota/billing-ish failure, not ordinary auth.

Immediate mitigation:

openclaw cron edit <job-id> --disable
# or move the jobs away from Kimi:
openclaw cron edit <job-id> --model "<provider/model-with-quota>"

Also keep:

#
openclaw config set cron.maxConcurrentRuns 1
openclaw gateway restart

For logs while RPC is unhappy, use the file tail fallback:

tail -f "$(ls -t /tmp/openclaw/openclaw-*.log | head -1)"

I’d file this as a GitHub issue with title like: “Kimi billing-cycle usage limit is classified as auth, causing short cron cooldown retry loop” and include these redacted lines + openclaw status --all: https://github.com/openclaw/openclaw/issues

unreal shore
#

What I would like is for openclaw to respect when a model returns that the quota has been met and provides a time when it is back up.

Instead of hammering the same model just pause it until the time from the API has elapsed.

inland pelicanBOT
#

Agreed — that’s the behavior I’d expect too.

In this log, Kimi returns:

HTTP 403 permission_error
You've reached your usage limit for this billing cycle

OpenClaw is treating that as ordinary auth, so it only gets a short auth cooldown. That’s why it can come back quickly and try Kimi again.

Better behavior would be:

  1. Detect quota/window exhaustion even when the provider wraps it as 403 permission_error.
  2. If the API gives Retry-After, reset_at, x-ratelimit-reset, etc., set cooldownUntil to that exact time.
  3. If it only says “next billing cycle” with no timestamp, use a longer billing/usage-window backoff instead of a 30s auth cooldown.
  4. Don’t let concurrent cron jobs all probe the same known-exhausted provider.
unreal shore
#

Okay. That I can do as well. But I would like to fix it ony my system as well.

I found this "allowTransientCooldownProbe" can I reach this from somehwere in the config file?

inland pelicanBOT
#

No — allowTransientCooldownProbe looks like an internal runtime flag, not a config key. I don’t see it exposed in the config reference.

It’s set by fallback logic when OpenClaw decides “all profiles for this provider are cooldowned, but this looks transient, so allow one probe.” It’s not something you can turn off in openclaw.json.

For your case, the annoying bit is that Kimi’s quota message is being classified as auth, so the exposed knobs like:

auth: {
  cooldowns: {
    rateLimitedProfileRotations: 0,
    overloadedProfileRotations: 0,
    billingBackoffHoursByProvider: { kimi: 24 },
  }
}

probably won’t fully fix this, because they only help if the error lands in rate_limit / overloaded / billing, not plain auth.

#

Best local workaround:

openclaw cron edit <job-id> --model "<non-kimi-model>"

or remove Kimi from agents.defaults.model.fallbacks until quota resets.

Also keep cron concurrency low:

openclaw config set cron.maxConcurrentRuns 1
openclaw gateway restart

So short answer: not configurable today; this likely needs a classifier fix in OpenClaw. Docs: https://docs.openclaw.ai/gateway/configuration-reference#authcooldowns

unreal shore
#

Thanks. Do you have some kind of pseudo-code for your proposed solution so that I can try to implement it myself?

Better behavior would be:

Detect quota/window exhaustion even when the provider wraps it as 403 permission_error.
If the API gives Retry-After, reset_at, x-ratelimit-reset, etc., set cooldownUntil to that exact time.
If it only says “next billing cycle” with no timestamp, use a longer billing/usage-window backoff instead of a 30s auth cooldown.
Don’t let concurrent cron jobs all probe the same known-exhausted provider.

inland pelicanBOT
#

Yep. I’d implement it as classifier + retry-until extraction + persisted cooldown override. Rough sketch:

// 1. Extract provider reset time from headers/body
function extractRetryUntil(error): number | null {
  const now = Date.now();

  const retryAfter = error.headers?.["retry-after"];
  if (retryAfter) {
    // "120" seconds or HTTP date
    const ms = /^\d+$/.test(retryAfter)
      ? now + Number(retryAfter) * 1000
      : Date.parse(retryAfter);
    if (ms > now) return ms;
  }
#
  const raw = stringify(error);
  const json = tryParseJsonBody(error);

  const candidates = [
    json?.reset_at,
    json?.resetAt,
    json?.rate_limit_reset,
    error.headers?.["x-ratelimit-reset"],
    error.headers?.["x-rate-limit-reset"],
  ];

  for (const value of candidates) {
    const ms = parseUnixOrIsoTime(value);
    if (ms && ms > now) return ms;
  }
#

  return null;
}

Then classification:

function classifyProviderFailure(error, provider): {
  reason: "auth" | "rate_limit" | "billing" | "overloaded" | ...,
  retryUntil?: number
} {
  const raw = stringify(error).toLowerCase();
  const retryUntil = extractRetryUntil(error);
#
  // Must happen before generic 401/403 => auth
  if (
    raw.includes("usage limit") ||
    raw.includes("quota will be refreshed") ||
    raw.includes("billing cycle") ||
    raw.includes("quota-upgrade")
  ) {
    return {
      // If provider gives exact reset, this is basically a rate-limit window.
      // If not, use billing-ish long backoff.
      reason: retryUntil ? "rate_limit" : "billing",
      retryUntil,
    };
  }
#
  if ((error.status === 401 || error.status === 403) && !retryUntil) {
    return { reason: "auth" };
  }

  if (retryUntil) {
    return { reason: "rate_limit", retryUntil };
  }

  return existingClassifier(error, provider);
}

Then change markAuthProfileFailure(...) to accept an override:

#
await markAuthProfileFailure({
  store,
  profileId,
  reason,
  retryUntil, // new optional field
  cfg,
  agentDir,
  runId,
  modelId,
});

Inside the usage-state update:

#
function computeNextProfileUsageStats(params) {
  const stats = existingLogic(params);

  if (params.retryUntil && params.retryUntil > Date.now()) {
    if (params.reason === "billing" || params.reason === "auth_permanent") {
      stats.disabledUntil = maxExistingWindow(stats.disabledUntil, params.retryUntil);
      stats.disabledReason = params.reason;
      stats.cooldownUntil = undefined;
    } else {
      stats.cooldownUntil = maxExistingWindow(stats.cooldownUntil, params.retryUntil);
      stats.cooldownReason = params.reason;
      stats.cooldownModel = params.modelId;
    }
  }
#
  return stats;
}

And for the concurrent-cron issue:

// Before starting a provider attempt:
const latestStore = await loadAuthProfileStore(agentDir);

if (isProviderOrProfileCoolingDown(latestStore, provider, modelId)) {
  throw new FailoverError("Provider is cooling down", {
    reason: resolvedCooldownReason,
    retryUntil: resolvedCooldownUntil,
  });