#Model failover doesn't trigger when provider hangs without returning an error

1 messages · Page 1 of 1 (latest)

frosty relic
#

Hey any help with this?

rocky crystalBOT
# frosty relic Hey any help with this?

Based on the docs and my research, here's the situation:

How Failover Works Currently

Clawdbot's failover mechanism handles:

  1. Auth profile rotation within a provider (on auth/rate-limit errors)
  2. Model fallback to the next model in agents.defaults.model.fallbacks
  3. Cooldowns for "a timeout that looks like rate limiting" (exponential backoff: 1m → 5m → 25m → 1h cap)

The Problem With "Hangs"

The issue you're describing—a provider that hangs without returning any error—touches on a subtle distinction:

  • HTTP timeout errors (server responds but slowly, or returns a timeout status) → trigger cooldown/failover ✅
  • TCP-level hangs (connection opens but nothing comes back, or DNS/resolution issues) → may not trigger proper failover ❌

The docs say "a timeout that looks like rate limiting" triggers cooldown, but a true "hang" (where the provider never responds at all) might not be properly detected.

#

What To Check

  1. Check your logs for timeout-related entries:

    clawdbot logs --follow | grep -i "timeout\|hang\|fetch"
    
  2. Verify model fallbacks are configured:

    {
      "agents": {
        "defaults": {
          "model": {
            "primary": "provider/model-a",
            "fallbacks": ["provider/model-b", "provider/model-c"]
          }
    
#
       }
     }
   }
  1. Check for provider-specific timeouts - some providers may need explicit timeout configuration

Workarounds

  • Add multiple models to your fallbacks list
  • Use providers with more reliable uptime
  • Monitor logs for hanging behavior

This might be a bug - if the provider genuinely hangs without any error response, it should still trigger failover. You might want to open a GitHub issue with:

  • Gateway version (clawdbot --version)
  • Relevant log snippets