I'm using a runpod serverless endpoint with worker limit 6. The endpoint performs well, except for one error: sometimes a worker gets "unhealthy" and HTTP requests fail with:
request failed: Post "https://api.runpod.ai/v2/s3bxj20mra4dvp/runsync": remote error: tls: bad record MAC
OR
"request failed: Post "https://api.runpod.ai/v2/s3bxj20mra4dvp/runsync\": write tcp [2001:1c02:2c09:9100:7bab:2fba:21cc:6df1]:53732->[2606:4700::6812:9dd]:443: use of closed network connection"
Observations:
Request fails at network level, not 500 status
Retry usually succeeds (load test: 10 concurrent requests, 5 workers online)
Failures continue for some time, then 100% success rate returns
Happening for months in production, handled via retries + fallback endpoints
We're scaling up and want to consolidate our three fallback endpoints to one/two for better worker efficiency.
Questions:
Anyone recognize this pattern? Solutions/workarounds?
Can I identify which worker was used per request to programmatically kill/restart it? Runpod eventually fixes this (internal /ping?), but takes too long, especially off-hours with few workers.
How does runpod queueing work? Since HTTP fails at network level, is there actual redirection to worker infra vs API proxy returning 500?
To runpod team; is this:
- A load balancing/health management bug fixable on your end?
- Infra limitation requiring retries?
- Misconfiguration in my endpoint/worker image?