#Serverless workers boot successfully but never receive jobs. confirmed 2 regions multiple GPU's

5 messages · Page 1 of 1 (latest)

red meteor
#

Hey Runpod & Community! Pleasure to be here. I'm hoping someone else may have seen this and discovered a potential solution.

We've been completely blocked since April 10. Our serverless endpoint worked fine for a few weeks up until Friday. Here's everything we've confirmed:

** Setup:**

  • Docker image: runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 (also tested with ghcr.io/runpod-workers/worker-comfyui:latest — same result)
  • Handler calls runpod.serverless.start({"handler": handler}) — standard pattern
  • RunPod SDK 1.9.0, all 7 fitness checks pass
  • Network volume with ComfyUI + models

** What happens:**

  1. Container starts, ComfyUI boots (~45-70s depending on GPU)
  2. Handler starts, calls runpod.serverless.start()
  3. SDK registers, fitness checks all pass
  4. Health API shows idle: 1, ready: 1 and inQueue: 1
  5. Worker never receives the job
  6. Container gets killed, new one spins up, same cycle
wispy wigeonBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

autumn mantleBOT
red meteor
#

** What we've ruled out:**

  • Docker image: tested with both our image and RunPod's official worker-comfyui image — identical behavior
  • GPU compatibility: tested on RTX 5090 (Blackwell), RTX 6000 Ada, A100 — all same result
  • Region: reproduced on both US-IL-1 and EU-RO-1 with separate endpoints and volumes
  • Handler code: confirmed working via local test_input.json — job runs end-to-end, fetches manifest from S3, runs ComfyUI workflow, uploads outputs. Full success!
  • SDK version: upgraded from 1.8.2 to 1.9.0, no difference
  • FlashBoot: tested both enabled and disabled
  • Idle timeout: tested at 5s, 60s, and 300s
  • Cold start time: reduced from 115s to 45s by disabling ComfyUI-Manager

** The core issue:**
The health API consistently shows a ready worker AND a queued job, but inProgress stays at 0. The platform's job dispatcher never routes the job to the worker. This is not a capacity issue — the worker is allocated and running.

** Evidence:**

  • Container logs show full boot through fitness checks
  • Local test via test_input.json completes successfully (3 images processed, outputs uploaded to S3)
  • This matches closely with similar issues described in this thread from Feb 2025

** Endpoint IDs:**

  • cjkw9e25smia4y (US-IL-1)
  • uggnjw95k3qbvm (EU-RO-1)

Can someone from engineering look at the dispatcher/routing layer for these endpoints? Support has been responsive but hasn't been able to identify the root cause after multiple exchanges.

autumn mantleBOT