Hello RunPod Support Team,
I'm experiencing an issue with our serverless deployment where containers automatically shut down after approximately 10 minutes of uptime, then restart with a new container, despite having configured the timeout settings to their maximum values:
Current Configuration:
Idle Timeout: 3600 seconds (maximum)
Execution Timeout: 3600 seconds (maximum, enabled)
Issue:
The container still shuts down after ~10 minutes and spawns a new instance, which disrupts our workflow.
Our Use Case:
We're running a ComfyUI image for an online image generation service with the following architecture:
Users upload images through our backend
User data is stored in Kafka
GPU workers pull tasks from Kafka for inference
Results are sent back to the backend via callback
Questions:
Is there a way to keep containers running continuously without automatic shutdown?
Given our Kafka-based task queue architecture, would Serverless be the appropriate solution, or should we consider switching to Pod services instead?
We've noticed that Pod availability for L40 and 4090 is often very limited (single-digit availability). Are there any recommendations for ensuring stable GPU availability?