#Serverless Image generation times

1 messages · Page 1 of 1 (latest)

terse walrus
#

Hey Runpod Team
we’ve been experiencing noticeable instability on Runpod lately, primarily related to how requests are assigned to workers. It seems that queued requests are sometimes being mapped to workers that are still pulling containers instead of ready ones. This leads to significant and inconsistent response times across our workloads. For a standard img2img workflow,
here’s a recent snapshot from our monitoring:

{
"total_requests": 27,
"success_count": 27,
"fail_count": 0,
"avg_runtime_s": 45.41,
"min_runtime_s": 19.04,
"max_runtime_s": 123.57,
"median_runtime_s": 33.38,
"p90_runtime_s": 76.83,
"p95_runtime_s": 89.63
}

Historically, our median runtime has been around 25 seconds for this workflow, so the current instability represents a substantial deviation.
We don’t believe this is related to CUDA (as before, fixed by runpod). The main issue seems to be the assignment of queued requests to non-ready (re-pulling) workers, which introduces unnecessary delays and resource waste. This behavior is directly impacting our customers through increased response times and occasional service interruptions.
We burning through money and customer trust, so i wish we can get some help. Already write in our dedicated channel on slack to the runpod team, but no answer yet.
Could you please look into this and advise if there’s a way to prevent queued requests from being routed to workers that are still pulling containers? Any guidance on stabilizing this behavior would be greatly appreciated.
Thanks a lot for your support!

hasty tigerBOT
viral groveBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

gleaming brook
#

Hi @terse walrus, can you DM me your Slack channel name and I'll reach out?

terse walrus
#

dm send 🙂

terse walrus
#

Hello,

On Serverless we experience request assignment to pulling/non-ready workers and we experience inexplicable delay times (5s-3min) even with a dozen of ready workers. for single jobs.

Is there anything we can do about this?

This burns through vast amounts of money for requests that should be done in below 20s.

hexed agate
#

I've reported the same issue, this is using the Load balancing endpoints with one active worker with a ready server. This happens randomly, don't know the exact percentage, probably around 5-10% of instances/workers.

terse walrus
#

yes. we have traditional queuing workflows, and also experience around this percentage these problems.

#

Last Run of our time measurement tooling:

{
  "total_requests": 90,
  "success_count": 90,
  "fail_count": 0,
  "avg_runtime_s": 33.47510890960693,
  "min_runtime_s": 13.903926134109497,
  "max_runtime_s": 97.0808629989624,
  "median_runtime_s": 25.49431037902832,
  "p60_runtime_s": 30.133028554916383,
  "p70_runtime_s": 35.32752711772919,
  "p80_runtime_s": 44.95820450782776,
  "p90_runtime_s": 63.619728827476536,
  "p95_runtime_s": 74.34830371141432
}```
#

thats really a bummer.

hexed agate
#

I've recommended them to run this test, to locate the issue: create a locust file (python load testing), run 1-2 requests every second with image and text payloads in JSON (base64) and multi-part form with Bytes data that adds a timeout and just returns he input (1-30MB payloads), and every hour increase the load to 50-100 requests per second for 10 minutes (to trigger an increase in 5-10 active workers), and then go back to 1-2 requests every second, and run it for 7 days, and create a script that automatically updates the docker image every 2 hours or so to test the new releases, and report back error rate in percentages, and the minium, average, and max request times.

terse walrus
#

We do something similar and schedule random bursts of request through a given time period with a normal distibution of requests where the mean is around 43s

hexed agate
#

Nice! yeah, most people who have active users will do something similar. i'm surprised runpod does not do production testing for reliability nor basic functionality testing (especially for the load balancing endpoints)

terse walrus
#

And my users starting to complain about this…

#

And nothing in my hand to migitate this is an extraorinary situation

hexed agate
#

Was the change recently or has it always been this way?

terse walrus
#

We using runpod since 2023 in production, and its a new situation for us…

hexed agate
#

Ah, cool, it's been stable until now?

terse walrus
#

i think we experienced just outages where runpod was not directly involved (recent aws otuage for example)

hexed agate
#

alright, that's good to know. Is the current issue across all regions?

#

Also, did the issue come when you created a new release by updating a docker image version? In the load balancing endpoint, there is a bug when you update the docker image version, the server pings 200, but looses outside connection.

terse walrus
#

It seems like its region independed and we just updated cuda, some endpoints are in use for a few months without issues, and now experience stuff like this, but the old problem with cuda versions doesnt seem to be the main problem currently.

hexed agate
#

when I was doing prod testing, same payload (0.5MB image) and internet connection, one active and ready worker, most workers had a consistent 3 second response time, but 5-10% of workers had a random response time between 3-120 seconds.