#Container Auto-Shutdown After 10 Minutes Despite Maximum Timeout Settings

12 messages · Page 1 of 1 (latest)

random swallow
#

Hello RunPod Support Team,

I'm experiencing an issue with our serverless deployment where containers automatically shut down after approximately 10 minutes of uptime, then restart with a new container, despite having configured the timeout settings to their maximum values:

Current Configuration:

Idle Timeout: 3600 seconds (maximum)
Execution Timeout: 3600 seconds (maximum, enabled)
Issue:
The container still shuts down after ~10 minutes and spawns a new instance, which disrupts our workflow.

Our Use Case:
We're running a ComfyUI image for an online image generation service with the following architecture:

Users upload images through our backend
User data is stored in Kafka
GPU workers pull tasks from Kafka for inference
Results are sent back to the backend via callback
Questions:

Is there a way to keep containers running continuously without automatic shutdown?
Given our Kafka-based task queue architecture, would Serverless be the appropriate solution, or should we consider switching to Pod services instead?
We've noticed that Pod availability for L40 and 4090 is often very limited (single-digit availability). Are there any recommendations for ensuring stable GPU availability?

abstract scaffoldBOT
latent jackal
#

Any endpoint id

#

What's your handler code like?

#

Full handler code if you can send

#

Is there a possibility that your comfyui hasn't loaded in 10~MINUTES and then it fails?

#

And to ensure more availability try to use multi regions

random swallow
random swallow
# latent jackal Full handler code if you can send
  1. ComfyUI Loading Status:
    We have confirmed that ComfyUI loads successfully and is fully operational shortly after the container starts. The restart happens consistently around the 10-minute mark while the service is successfully running, not during startup.

  2. Multi-Region:
    Yes, we have enabled multi-region deployment to maximize availability.

  3. About the Handler Code (Crucial Context):
    This is where I suspect the issue lies. We are NOT using the standard Request/Response handling model.

Our architecture is a Kafka Consumer (Pull Model):

Instead of waiting for an HTTP request from the RunPod API handler(event), our container starts a long-running Python script that creates a Kafka Consumer loop.
The worker constantly listens to a Kafka topic, picks up jobs, processes them with ComfyUI, and sends results back to our backend.
Because of this, we might not be returning the handler object in the way RunPod Serverless expects (since we never "finish" a request, we are always listening).

abstract scaffoldBOT
latent jackal
#

press the green button to open a ticket

latent jackal