All 27 workers throttled | Runpod | Page 1

stuck nymph Feb 21, 2024, 7:18 PM

#

Our company needs stable aviability of minimum 10 workers. Quite recently the biggest part or even all workers are throttled. We arleady spent more than 800-1000$ on you service and would be pretty grateful whether there will be some stable amount of requested workers. IDS: 6lxilvs3rj0fl7, 97atmaayuoyhls. Our customers have to wait for hours...

serene boneBOT Feb 21, 2024, 7:18 PM

#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

hidden reef Feb 21, 2024, 7:18 PM

#

Does your endpoint use network storage in RO region?

stuck nymph Feb 21, 2024, 7:19 PM

#

Network is in EU-CZ-1

#

Our company would be very grateful for the solution. The availability tends to stay the same for last few days. Due to huge waiting time we are losing money 😦 We were thinking of slowly increasing the amount up to 30+, but now we can't even have 5 stable working workers 😦

hidden reef Feb 21, 2024, 7:25 PM

#

Yeah looks like its basically a no-go in that region, you may want to consider setting up a new endpoint in either EU-SE-1 or EU-NO-1 regions. I had this same issue with EU-RO-1 and had to create a new endpoint.

stuck nymph Feb 21, 2024, 7:25 PM

#

The thing is that the network itself doesnt allow other regions, even if i deploy it to any location

hidden reef Feb 21, 2024, 7:26 PM

#

Yeah I created a new network volume as well.

#

Its very inconvenient but better than having down time and losing money.

quaint oyster Feb 21, 2024, 7:27 PM

#

https://discord.com/channels/912829806415085598/1194711850223415348

Can refer to this to how to copy data over in case downloading it from some other source not an option

#

https://discord.com/channels/912829806415085598/1209602115262095420

Also was something we gave as a feedback to @pallid imp . Sadly the fact that serverless workers can get fully throttled across the board on a region i find frustrating / insane too

hidden reef Feb 21, 2024, 7:28 PM

#

Yeah it shouldn't happen that every single worker becomes throttled and brings down our production applications.

stuck nymph Feb 21, 2024, 7:29 PM

#

How often does this problem happen? We recently moved to serverless instead of gpu cloud, but the expirience is quite sad by far

quaint oyster Feb 21, 2024, 7:29 PM

#

stuck nymph How often does this problem happen? We recently moved to serverless instead of g...

Just wondering, how big are your models?

stuck nymph Feb 21, 2024, 7:29 PM

#

quaint oyster Just wondering, how big are your models?

about 3gb, one model

hidden reef Feb 21, 2024, 7:29 PM

#

Happens A LOT. Happened to me at least 3 or 4 times in the last 6 months.

stuck nymph Feb 21, 2024, 7:29 PM

#

probably even smaller

quaint oyster Feb 21, 2024, 7:29 PM

#

stuck nymph How often does this problem happen? We recently moved to serverless instead of g...

I think for the 4090s the 24gb Pro, it happens a decent amount. I try to avoid it and go 24gb + 48gb gpu.

Also if ur only 3gb

#

build it into the image instead

#

Ull get way way more flexibility

#

and less of this issue to where i dont have problems with those endpoints with 10+ workes

#

anything that is < 35gb

#

I build into my model

#

if it doesnt need dynamic switching

stuck nymph Feb 21, 2024, 7:30 PM

#

quaint oyster I think for the 4090s the 24gb Pro, it happens a decent amount. I try to avoid i...

Already using 24 + 24 pro. Where can i find more info about this method?

hidden reef Feb 21, 2024, 7:30 PM

#

All 24GB PRO in RO are gone , thats why all my workers in RO are throttled, in a matter of WEEKS, it went from high availbility for 4090 to nothing and all my workers throttled

stuck nymph Feb 21, 2024, 7:30 PM

#

hidden reef Happens A LOT. Happened to me at least 3 or 4 times in the last 6 months.

And how long does it take to be resolved in average?

quaint oyster Feb 21, 2024, 7:31 PM

#

stuck nymph Already using 24 + 24 pro. Where can i find more info about this method?

When you select, select 1 on the 48pros, and 2 as the 24gb.

Also, if you build the image into the model, and get off network storage, ull be able to use all data centers not just ones tied to network volume

hidden reef Feb 21, 2024, 7:31 PM

#

stuck nymph And how long does it take to be resolved in average?

Weeks, months, I move to a new endpoint

quaint oyster Feb 21, 2024, 7:31 PM

#

stuck nymph And how long does it take to be resolved in average?

I saw someone recently @subtle spire who was throttled for an hour. so i suggest in ur situation, move to building the model into the image, and shouldnt be an issue

hidden reef Feb 21, 2024, 7:32 PM

#

48GB PRO is low availability, I don't recommand

stuck nymph Feb 21, 2024, 7:32 PM

#

quaint oyster When you select, select 1 on the 48pros, and 2 as the 24gb. Also, if you build...

The thing is i am using automatic1111 + custom model + LORAs

hidden reef Feb 21, 2024, 7:32 PM

#

stuck nymph The thing is i am using automatic1111 + custom model + LORAs

Same here

quaint oyster Feb 21, 2024, 7:32 PM

#

Im just sharing what i have, i get high on 16gb, and 48pro at least for me with no network region

quaint oyster Feb 21, 2024, 7:33 PM

#

stuck nymph The thing is i am using automatic1111 + custom model + LORAs

dockerhub lets u have one private repo

#

that's what i do for my private stuff

#

unless u have more stuff

#

It always the 4090s that bottleneck me

hidden reef Feb 21, 2024, 7:33 PM

#

WTF shows LOW for me without a network volume

stuck nymph Feb 21, 2024, 7:33 PM

#

quaint oyster dockerhub lets u have one private repo

So you manually push volumes to dockerhub and build from image directly?

quaint oyster Feb 21, 2024, 7:34 PM

#

u could be right ashelyk, just found out im throttled across the board

quaint oyster Feb 21, 2024, 7:34 PM

#

stuck nymph So you manually push volumes to dockerhub and build from image directly?

No not push volumes to dockerhub

#

U can just do some function call in ur dockerfile to download the model

hidden reef Feb 21, 2024, 7:34 PM

#

Maybe became medium availability for a brief moment, workers are constantly moving around

quaint oyster Feb 21, 2024, 7:35 PM

#

https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/Dockerfile

#

here is an example

stuck nymph Feb 21, 2024, 7:35 PM

#

quaint oyster u could be right ashelyk, just found out im throttled across the board

this is so frustrating)))

stuck nymph Feb 21, 2024, 7:35 PM

#

quaint oyster U can just do some function call in ur dockerfile to download the model

ok i see wym

#

Thank you!

quaint oyster Feb 21, 2024, 7:35 PM

#

stuck nymph this is so frustrating)))

yea i asked flash about this before, and its b/c someone can just eat up all the gpus for their super big clients. Something im debating on is if i get fully throttled across the board, i use their graphql endpoint

#

to set a minimum of 2 active workers

#

to steal back workers

#

https://github.com/justinwlin/runpod-api

GitHub

GitHub - justinwlin/runpod-api: A collection of Python scripts for...

A collection of Python scripts for calling the RunPod GraphQL API - justinwlin/runpod-api

#

@hidden reef got a repo on that

#

It isnt an instant switch

#

but better than getting fully throttled

#

it seems to respect minimum workers

#

and prioritze it

stuck nymph Feb 21, 2024, 7:36 PM

#

stuck nymph ok i see wym

And i will be able to use all data centers? The problem will be resolved or they still have this one sometimes even on the bigger amount of data centers?

quaint oyster Feb 21, 2024, 7:37 PM

#

stuck nymph And i will be able to use all data centers? The problem will be resolved or they...

Ull be able to use all data centers and not locked to a region

#

I think the problem will happen more rarely, @pallid imp supposedly has said if a worker is throttled for an hour, it terminates and switches it out, but that is crazy to me, why it would allow us to fall into an all worker throttle situation; also im not sure that really happens to be honest

#

so i recommend maybe to explore the minimum worker force scenario, b/c i ping the /health on my endpoint routinely

#

#

an ex of me pulling a minimum of 2 workers now

#

to forcefully get my workers back

#

maybe make ur numbers look like this

#

4090s are always eaten up, so should prob be the #3

#

or whatever the lowest number is

#

tbh idk what the numbers even do 🤷🤷🤷 which i complained about too

pallid imp Feb 21, 2024, 7:41 PM

#

are you mostly looking for A5000s and 24gb mostly?

stuck nymph Feb 21, 2024, 7:41 PM

#

yes

pallid imp Feb 21, 2024, 7:43 PM

#

EU-SE-1 is the best for that, EU-CZ-1 always has low quantity of those, and 3090s are always taken, were you looking for 3090s?

#

are you able to move storage?

stuck nymph Feb 21, 2024, 7:44 PM

#

we look for 24gb gpu, the model of gpu does not matter. I guess i can make a new storage in different data center

pallid imp Feb 21, 2024, 7:45 PM

#

you can either make a new endpoint, or switch your current one to use EU-SE-1, currently that one has the biggest capacity for 48gb and 24gb and 16gb but they do not have 4090s

quaint oyster Feb 21, 2024, 7:45 PM

#

pallid imp EU-SE-1 is the best for that, EU-CZ-1 always has low quantity of those, and 3090...

B/c he is a 3gb model, i think its better to just build into docker image in situations like that right? then he wouldn't be limited to a region?

#

and he can also just take out EU-CZ-1 from his region list

#

so he doesnt get assigned any there?

pallid imp Feb 21, 2024, 7:46 PM

#

yes i would never use network volume if your running 1 static model

stuck nymph Feb 21, 2024, 7:46 PM

#

pallid imp yes i would never use network volume if your running 1 static model

ty! will try the method above

pallid imp Feb 21, 2024, 7:47 PM

#

yep pick global and it will automatically pick most available servers across all regions

#

EU-SE-1 has plenty of capacity but its also newer compared to most of other ones

quaint oyster Feb 21, 2024, 7:48 PM

#

pallid imp EU-SE-1 has plenty of capacity but its also newer compared to most of other ones

do u guys plan to make a chart or something detailing this informatino at some point 😦 😅

#

😔😔😔

#

or do we only have to get this anecdotally

pallid imp Feb 21, 2024, 7:48 PM

#

tbh im not using anything special, i just go click EU-SE-1 and see their all high

#

but yes we do need to get better at showing availability, we also have a bug with network storage tab showing you wrong availability, we are working on fixing that this week

#

i def understand the frustration, it causes us stress as well, but solving scale for GPUs, its more complicated and requires big investment, we are trying to push towards all directions to be better at this

quaint oyster Feb 21, 2024, 7:57 PM

#

pallid imp i def understand the frustration, it causes us stress as well, but solving scale...

yeah still thx u runpod for making gpu / ml saas businesses a whole lot easier lol

pallid imp Feb 21, 2024, 7:59 PM

#

still many pain points as you can see, getting there by the day

hidden reef Feb 21, 2024, 8:51 PM

#

By the way not using network storage doesn't even help, this endpoint of mine doesn't use any network storage and almost all my workers are throttled, this is a serious problem with 24GB GPU, basically zero availability anywhere.

#

Massive problem, we have a stand at the PBX Expo in Las Vegas and this is impacting our product demonstations 😡
CC: @dusk tree

#

I don't understand, because if I edit my endpoint, it says "High Availability" for 24GB yet basically all my workers are throttled.

quaint oyster Feb 21, 2024, 9:02 PM

#

hidden reef I don't understand, because if I edit my endpoint, it says "High Availability" f...

Not sure if this helps / u prob already did it, but I had to reset my max workers to 0, and then back to 12, and kick out EU-CZ-1 so I dont get assigned any of the GPUs from that region. I think the big problem with Runpod's worker right now is that it seems to only stay on the first assigned GPU, and cause i had the same experience about after editing my endpoints I was also throttled fully until i forcefully refreshed all the workers back.

Edit:
could setting minimum workers temporarily if the stand is active, temporarily relieve the issue? x.x..

quaint oyster Feb 21, 2024, 9:03 PM

#

hidden reef Massive problem, we have a stand at the PBX Expo in Las Vegas and this is impact...

/ @dusk tree / @pallid imp hopefully can chime in tho .-. i also am confused what the best steps are in these situations; if we edit the endpoint do we need to refresh all the workers? what is the expected procedure..

hidden reef Feb 21, 2024, 9:09 PM

#

Wow thats a major fail, if all my workers end up in CZ and get throttled, it should pick workers from somewhere else

hidden reef Feb 21, 2024, 9:09 PM

#

quaint oyster / <@244335936031293440> / <@210036719238512640> hopefully can chime in tho .-. i...

Good question, changing priority made zero difference, I had to scale workers down to zero and back up again which sucks

quaint oyster Feb 21, 2024, 9:10 PM

#

hidden reef Wow thats a major fail, if all my workers end up in CZ and get throttled, it sho...

Totally agree extremely frustrating

#

I moved all my endpoints to kick cz-1 out so im not assigned a bad region cause the priority algorithm rlly is bad and seems to do nothing

hidden reef Feb 21, 2024, 9:11 PM

#

I changed all my endpoints from 24GB to 48GB, 24GB tier is totally and utterly fucked up and completely unusable and nice how nobody from RunPod bothers to fucking respond when we have a fucking PRODUCTION ISSUE. THIS IS TOTALLY UNACCEPTABLE!!!!!!!!!!!!!!!!!!!!!!!!

#

I am looking for a new provider in the morning, RunPod is utter shit if you can't get support.

#

cc @past lance

quaint oyster Feb 21, 2024, 9:22 PM

#

quaint oyster / <@244335936031293440> / <@210036719238512640> hopefully can chime in tho .-. i...

https://discord.com/channels/912829806415085598/1209973235387474002

I agree, you guys need to change the priority algorithm, to something similar to my feedback. It at least needs to be visibly proactive trying to find workers, and start shifting at least two-three workers immediately out of throttle after like 5-10 seconds rather than letting it sit. Again, I have zero clue how the priority algorithm works, but we can't optimize anything to Runpod's specification cause there is nothing for us to specify. Honestly I'd even write my own priority algorithm if I could.

pallid imp Feb 21, 2024, 9:43 PM

#

hidden reef By the way not using network storage doesn't even help, this endpoint of mine do...

can you share endpoint id?

#

that seems like a bug

quaint oyster Feb 21, 2024, 9:48 PM

#

pallid imp can you share endpoint id?

Ill let @hidden reef ping his endpoint when he can, but b/c I experienced it too:
qie98s97wqvw4t

This one is mine. Ik ashelyk's is more production critical, but it seems like a bug with the priority algorithm then if me / him are both able to get fully throttled. I mean its fixed following the steps I said, to reset max workers to 0, shift my priorities around, kick CZ out, but I just wonder why I need to manually do this, and scale all my workers to 0 myself, rather than the priority algorithm handling this for me.

#

Also if the editing of workers is sensed and updated, it should really try to recalculate all the throttled workers and begin to try to shift them over if there is avaliability, i think that is why ashelyk / i was confused, when editing out endpoint and nothing happens

pallid imp Feb 21, 2024, 9:51 PM

#

i see all 21 workers are idle, so whats likely happening is there is a huge spike of work which takes many gpus, and that slows down

quaint oyster Feb 21, 2024, 9:52 PM

#

pallid imp i see all 21 workers are idle, so whats likely happening is there is a huge spik...

U said the throttle is switched out every hour before, is it possible to move 2-3 of them actively before that hour is hit? Also I think its b/c he refreshed all his workers

#

#1209942179527663667 message

#

Where he had to scale them all to zero and back

pallid imp Feb 21, 2024, 9:52 PM

#

quaint oyster U said the throttle is switched out every hour before, is it possible to move 2-...

we will have to optimize that further but right now a huge spike will cause throttle and that will wind down after few mins

pallid imp Feb 21, 2024, 9:53 PM

#

hidden reef By the way not using network storage doesn't even help, this endpoint of mine do...

this is showing all idle now

quaint oyster Feb 21, 2024, 9:53 PM

#

pallid imp we will have to optimize that further but right now a huge spike will cause thro...

I think this is a bug then, its not a few mins

#

Yeah it is

quaint oyster Feb 21, 2024, 9:53 PM

#

pallid imp this is showing all idle now

b/c he changed it

#

but he obvs had the convoersation longer than 3 mins

quaint oyster Feb 21, 2024, 9:54 PM

#

hidden reef By the way not using network storage doesn't even help, this endpoint of mine do...

maybe ashelyk can share his graph at a closer time scale but im sure he got fully throttle

pallid imp Feb 21, 2024, 9:54 PM

#

got it, so he must reset the workers

#

oh i do see throttle spike, then init spike so he must reset it

quaint oyster Feb 21, 2024, 9:54 PM

#

pallid imp got it, so he must reset the workers

Yeah, I guess, then my question is this a bug with the priority algorithm?

quaint oyster Feb 21, 2024, 9:54 PM

#

pallid imp oh i do see throttle spike, then init spike so he must reset it

What do u mean reset?

pallid imp Feb 21, 2024, 9:55 PM

#

set max to 0

quaint oyster Feb 21, 2024, 9:55 PM

#

pallid imp set max to 0

Okay, so there no way to do this automatically?

past lance Feb 21, 2024, 9:55 PM

#

it's not a bug as much as priority algo isn't good

pallid imp Feb 21, 2024, 9:55 PM

#

we do it automatically but it occurs hourly, will need to optimize that

past lance Feb 21, 2024, 9:55 PM

#

we're thinking to just allow users to set a quota per gpu type in addition to assigning launch priority

#

what happened in the past few days is that a few of our larger customers flexed up 600+ serverless workers

quaint oyster Feb 21, 2024, 9:56 PM

#

pallid imp we do it automatically but it occurs hourly, will need to optimize that

Is it possible to guarantee like a 2 worker minimum to do it immediately? I think that would even fix the current issues

#

ANd also if someone manually changes it to start searching for new gpus if any are throttled?

#

I guess the problem is that ashelyk had to manually scale to 0 in a production env

#

if we could even scale down to half and scale back up

#

that be nice

pallid imp Feb 21, 2024, 9:57 PM

#

quaint oyster Is it possible to guarantee like a 2 worker minimum to do it immediately? I thin...

yeah have to optimize that to take these conditions into account

quaint oyster Feb 21, 2024, 9:58 PM

#

pallid imp yeah have to optimize that to take these conditions into account

I see, i guess my next question is it possible for me to terminate workers through the graphql endpoint?

https://graphql-spec.runpod.io/#definition-PodStatus

Cause I want to write a script on my server to force minimum workers or terminate throttled workers if I have jobs in the queue, and I need it to be more proactive

#

Do I treat it like a pod?

pallid imp Feb 21, 2024, 10:00 PM

#

yes its similar, i plan to optimize this either way

quaint oyster Feb 21, 2024, 10:01 PM

#

pallid imp yes its similar, i plan to optimize this either way

Yeah i guess do u know when it will be estimated to be optimized?

i guess im looking into it cause I want to start feeding it more requests soon to my LLM / stuff, but Ill write the script depending on the time frame to just have minimum workers dynamically set if i have to

quaint oyster Feb 21, 2024, 10:06 PM

#

pallid imp yes its similar, i plan to optimize this either way

thank u tho, appreciate that the priority algorithm can be looked into / optimized / hopefully shared what its doing at some point too after reoptimized. I guess the fact that its an hour in a state of throttle, is a very badly known fact.

pallid imp Feb 21, 2024, 10:07 PM

#

whats your endpoint id? let me check logs for it

quaint oyster Feb 21, 2024, 10:12 PM

#

pallid imp whats your endpoint id? let me check logs for it

I mean its not an issue for me, qie98s97wqvw4t b/c im not in a production env like ashelyk is, im just setting it up so that I can start testing > and moving my whole pipeline through cause I was relying on ChatGPT and it was costing it too much. But I commented in bc when this conversation started, and I wanted to share how not using a network volume could give u better avaliability:
#1209942179527663667 message

I myself was throttled across the board in my to-be example of you shouldnt rely on network storage - but honestly, ive posted about this multiple times in the past too, and i guess as zeen said u guys have experienced insane uptick in the last 3 days

pallid imp Feb 21, 2024, 10:13 PM

#

planning on releasing optimizations tomorrow, have to tweak the knobs carefully otherwise it causes network issues

quaint oyster Feb 21, 2024, 10:14 PM

#

pallid imp planning on releasing optimizations tomorrow, have to tweak the knobs carefully ...

Great, Im glad. If those release optimizations end up being done, do you think can tell us what it ends up being? So we know what to be aware of what the changes are?

#

Thank you

#

https://discord.com/channels/912829806415085598/1209973235387474002

Again, I think the biggest issue @hidden reef (and honestly even anyone else who would be using runpod in production) and why it wouldn't be taken srsly is b/c if u are fully throttled across the board and have no options to fix avaliability that really is the worst nightmare.

pallid imp Feb 21, 2024, 10:15 PM

#

ill share what i can here

quaint oyster Feb 21, 2024, 10:15 PM

#

pallid imp ill share what i can here

thanks!

#

sorry for hammering u guys so much 👁️ know there is a lot behind the scenes

pallid imp Feb 21, 2024, 10:21 PM

#

we are here to support, something we need to optimize regardless

hidden reef Feb 22, 2024, 5:16 AM

#

past lance what happened in the past few days is that a few of our larger customers flexed ...

So basically what you are saying is that money is more important to RunPod than providing a stable service to all customers that RunPod can increase the number of workers for larger customers to such an extent that it takes down the endpoints of all other customers? 😡

#

@pallid imp my endpoint was idle because 24GB tier is unusable and I had to change it 48GB tier and scale it down and back up again because editing the endpoint is shit and can't update automatically.

quaint oyster Feb 22, 2024, 5:22 AM

#

hidden reef <@210036719238512640> my endpoint was idle because 24GB tier is unusable and I h...

Yeah, hopefully tho the coming changes that he proposes this week will fix it

#1209973235387474002 message

Definitely is an issue that I think they will work to address, and let's see where it goes. i am glad to see that the hour throttle will drop down to 4 mins to start swapping things around + allow movement with less restrictions so hopefully runpod's algorithm will be a heck lot more proactive

past lance Feb 22, 2024, 12:15 PM

#

hidden reef So basically what you are saying is that money is more important to RunPod than ...

No we had an internal discussion and all agreed that the quota shouldn't have been increased in this case.

stuck nymph Feb 22, 2024, 4:37 PM

#

@pallid imp i just want to thank you for your job and your product. Despite some throttling problems our company really appreciates the desire to fix problems instead of ignoring customers as most support team do

sage osprey Feb 22, 2024, 6:59 PM

#

I have a few questions here. What exactly is best practice when availability runs low in the region where we have a network volume. Should we keep endpoints active in multiple regions?

#

On a similar note, is there a best practice regarding when to use a network volume and when to bundle models into our image? If we have 20gb of models, should that all just be bundled or should we be using a network volume?

quaint oyster Feb 22, 2024, 7:01 PM

#

sage osprey On a similar note, is there a best practice regarding when to use a network volu...

I think this should be bundled, tbh. I find < 30gb for the compressed image shown on dockerhub quite safe, this is an example of my Mistral one.
https://hub.docker.com/layers/justinwlin/mistral7b_openllm/latest/images/sha256-47f901971ee95cd0d762fe244c4dd625a8bf7a0e0142e5bbd91ee76f61c8b6ef?context=repo

quaint oyster Feb 22, 2024, 7:02 PM

#

sage osprey On a similar note, is there a best practice regarding when to use a network volu...

Haha, I saw you respond in the different thread, but Ill continue to answer here

#

The number just comes from trial and error anecdotally

#

If you get too high, the download time to serverless initialization becomes impossible. So I find that < 30gb is reasonable first initialization time. Once you start pushing that boundary, I just find it personally a bit weird.

sage osprey Feb 22, 2024, 7:04 PM

#

Ok, I'll give it a shot. That implies that I could ditch the network volume and use the global region which should help tremendously with availability.

quaint oyster Feb 22, 2024, 7:05 PM

#

The runpod base image, is what I tend to use, so there is some cost there, but if you want to optimize it to the core, I saved maybe 1-2 gbs, not using the runpod-pytorch as a starting point.

https://github.com/justinwlin/runpodWhisperx/blob/master/Dockerfile

But tbh, nowadays i just end up building on it cause it saves me a lot of headache:
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/Dockerfile

quaint oyster Feb 22, 2024, 7:06 PM

#

sage osprey Ok, I'll give it a shot. That implies that I could ditch the network volume and ...

Yeah, u wont get locked in per region. Also another thing is the priorities do matter. It tries to assign u a lot of whatever you put as (1) when you first initialize, I try to put a (1) priority on 24gb, or 48gb, but not on 24gb pro.

The 24gb pro and 48gb is like very similar in cost, but the 24gb pro just isn't worth the headaches it gives

#

I also got rid of the EU-CZ-1 region, cause I dont want to get assigned any GPUs from there cause that region got some avaliability issues around the 24gb pro it seems. Im sure the changes Flash is making will get throttled workers to move around way better, but I rather just not deal with it

#

#

example what i mean

#

sage osprey Feb 22, 2024, 7:15 PM

#

This is helpful, thank you Justin!

weak igloo Feb 23, 2024, 9:27 PM

#

Still encountering this issue trying to get 4090s as of this afternoon:

hidden reef Feb 23, 2024, 9:28 PM

#

Yep, all my workers are throttled again too, RunPod serverless is pretty unusable at the moment

#

I even have 2 different endpoints in different regions and they are both throttled

pallid imp Feb 23, 2024, 11:46 PM

#

4090s are too high in demand right now and more supply will be added in 1-2 weeks

hearty tulip Feb 24, 2024, 12:29 AM

#

48gbs were all throttled in CA today too.

hidden reef Feb 24, 2024, 6:18 AM

#

Yes my endpoints are 48GB in SE and CA and both fully throttled. Also my 24GB without network storage and thus no region affinity also fully throttled. Serverless is a joke. I'm an enterprise customer but all my endpoints are fully throttled and cannot get support from RunPod so I'm taking my business elsewhere because this is totally unacceptable @pallid imp @past lance @dusk tree

past lance Feb 24, 2024, 2:35 PM

#

hidden reef Yes my endpoints are 48GB in SE and CA and both fully throttled. Also my 24GB wi...

Hey, I know not much I can say after the fact can fix past pain, but we have made a few platform releases to improve the throttling in the past day as well as added more capacity (way more coming next week). We've got a lot of customer using serverless and we've experience a spike in consumption usage that is just enormous and we're trying our best to handle it. We apologize for affecting your business and we are trying our best to find a balance between action and messaging.

hearty tulip Feb 25, 2024, 12:36 AM

#

Still getting throttled constantly. Serverless doesn't seem viable in its current state. Bummer. The tech is cool.

marsh bay Feb 25, 2024, 4:26 AM

#

It’s insane to me that I’m just getting throttled out of the blue without a heads-up

#

All of my workers just won’t start and every previously working GPU is now unavailable

#

This happened yesterday in EU-SE1 and now today in EUR-NO-1

#

What’s happening? @past lance @pallid imp @dusk tree

hidden reef Feb 25, 2024, 6:37 AM

#

Looks like RunPod may have fixed something aroung 3.5 hours ago, all my endpoints throttled workers seem to have recovered around the same time.

hidden reef Feb 25, 2024, 8:41 AM

#

Looks like I spoke too soon, they looked better for a short while, now getting throttled again.

marsh bay Feb 25, 2024, 9:14 AM

#

This sucks

hidden reef Feb 25, 2024, 9:20 AM

#

marsh bay This sucks

Basically no GPUs available in NO, SE has some 16GB and 24GB

#

SE

#

NO

#

I don't understand whats going on though because in NO I have no throttled workers.

amber scroll Feb 25, 2024, 10:54 AM

#

same issue with throttled workers... personally think RunPod has to scale up at this point ASAP

previously we can stand on just using A5000s and only 4090s were in throttling hell... but now, even that is throttled indefinitely

the issue has been happening for several days now, and the obvious solution of "just use 'active workers' " isn't really viable at our small scale, because doing that would be just like paying for the machines directly...

we are running a community supported project

marsh bay Feb 25, 2024, 10:57 AM

#

The lack of communication is really concerning

digital vigil Feb 25, 2024, 11:08 AM

#

Same here, running production site. This happened to me before (I moved from US to EU) for availability and now it happened in EU again.

pallid imp Feb 25, 2024, 2:18 PM

#

we have tweaked the algos but at certain points in the day the spikes eat up all the capacity, we are adding more gpus this week for A5000 and 4090s

hidden reef Feb 25, 2024, 2:19 PM

#

I think you need to add more network capacity too, too many machines on the same network seems to be causing issues where everyone is experiencing slow speeds, serverless getting connection timed out issues, peoples pods disappearing etc etc.

#

I just had to terminate workers for an endpoint because they were getting stuck for 5mins on a job that takeds 14 seconds, due to network connectivity issues. Then a new worker spawned and also got stuck eating up all my credits and the job doesn't even get processed, it gets stuck on IN_PROGRESS.

#

My manager has demanded a refund for this because its unacceptable.

amber scroll Feb 25, 2024, 5:24 PM

#

I just had to terminate workers for an endpoint because they were getting stuck for 5mins on a job that takeds 14 seconds, due to network connectivity issues. Then a new worker spawned and also got stuck eating up all my credits and the job doesn't even get processed, it gets stuck on IN_PROGRESS.
My manager has demanded a refund for this because its unacceptable.

This also happens to us... we were getting charged for 10+ minutes for a worker that kept "queueing image for pull" and the job was still IN_QUEUE... I was gonna report it but I didn't know if we were actually being charged or if it was just a UI thing

#

we chewed through $3 of credits in ~24 hours when we usually only spend $0.74/day as per our size... and our jobs only took 2-3s

amber scroll Feb 25, 2024, 5:26 PM

#

amber scroll > I just had to terminate workers for an endpoint because they were getting stuc...

it actually happened twice, and that was when I was there to see it... so it's definitely been doing that multiple times per hour

hidden reef Feb 25, 2024, 5:29 PM

#

@amber scroll are you using latest tag for your Docker image?

amber scroll Feb 25, 2024, 5:29 PM

#

hidden reef <@181174394536722432> are you using `latest` tag for your Docker image?

we have our own tagging system that tags images based on the commit message

#

I don't think it's very much relevant to the issue, but the tag was sm-q, hosted on our private docker registry

hidden reef Feb 25, 2024, 5:31 PM

#

amber scroll I don't think it's very much relevant to the issue, but the tag was `sm-q`, host...

Is it possible to push a new image to the same tag?

amber scroll Feb 25, 2024, 5:32 PM

#

I guess so-? but runpod caches the images per-datacenter, so that usually just happens in development... which is why we have semver for dev images

#

the image pulls just fine and we use it in prod, the issue is on the worker's side... infinitely "queuing image for pull" and us getting charged for a job that's not even in progress

amber scroll Feb 25, 2024, 6:30 PM

#

the issue occured again:

2024-02-25T18:28:18Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:28:34Z create container ***/serverless-llm:sm-q
2024-02-25T18:28:34Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:28:51Z create container ***:sm-q
2024-02-25T18:28:51Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:29:07Z create container ***/serverless-llm:sm-q
2024-02-25T18:29:07Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.

it's been doing that for 3 minutes.

#

and we're getting charged for it...

#

so far in the past 30 minutes, runpod has chewed through 10 cents

#

if we calculate how much requests that would've been: 0.1 / (0.00026 * 3), it means we should've been receiving ~128.21 requests in that past 30 minutes

#

this doesn't look like 128 requests to me:

#

not even close

#

@pallid imp sorry for the direct ping but uh, it's actually chewing through our balance, another 8 cents has just been deducted.

what do we do?

#

2 cents deducted out of nowhere, there are no jobs running across all endpoints

pallid imp Feb 25, 2024, 6:37 PM

#

its just 1 worker? terminate that for now, ill look into the bug

amber scroll Feb 25, 2024, 6:38 PM

#

we tried setting max worker count to 8 to try and see if that will improve the delay time... it didn't

pallid imp Feb 25, 2024, 6:38 PM

#

due to throttled workers?

amber scroll Feb 25, 2024, 6:39 PM

#

Yupp

pallid imp Feb 25, 2024, 6:40 PM

#

higher max workers can help but ideally much of the compute is saturated and expansion is already planned this week for some gpus

amber scroll Feb 25, 2024, 6:40 PM

#

What we're also thinking is that it might be deducting from cancelled jobs

the timer goes up each refresh, and these jobs were previously cancelled due to them taking too long... and our systems just cancelled them to prevent too much usage...

the timeout is set to 120s (queueing included)

pallid imp Feb 25, 2024, 6:41 PM

#

cancelled wont charge once triggered, we stop those workers running the job

amber scroll Feb 25, 2024, 6:54 PM

#

holy crap

#

I think the best way to go for now is to shutdown our AI chatbot feature until this infrastructure issue is fixed

#

we can't have our contributors' money wasted over runpod's scaling issue

#

if this goes unwatched, who knows how much money it'll siphon out

#

and we aren't certain if we're going to get refunded for this

#

tried contacting sales... welp.

amber scroll Feb 25, 2024, 7:26 PM

#

currently trying to run a smaller version of our model on 16GB temporarily
1 week of downtime is too big of an impact for us apparently

silver pulsar Feb 27, 2024, 5:22 AM

#

@amber scroll hey was your issue ever resolved? I looked through my logs and saw a sudden huge spike in credit consumption for just a couple jobs. It looks like the "delay" time it took to even run the job was counted into the actual gpu usage :T

#

I'd like to add it was also on the same dates as your issues. Feb 24/25

amber scroll Feb 27, 2024, 5:31 AM

#

silver pulsar <@181174394536722432> hey was your issue ever resolved? I looked through my logs...

got in touch with sales, they gave back the burned credits based on our 30 day average

Right now we're running the model on 16GB which is a bit more expensive due to the longer inference time (despite being 30% cheaper, the model took 60% longer to produce output)

so ideally we should go back to 24GB, but we'll have to wait for RunPod's announcement regarding GPU availability... According to sales:

"It's probably going to be a gradient over time rather than a binary state of being resolved/not resolved since we add more capacity on a weekly/biweekly basis; we do announce big supply adds on Discord when they come through so that's probably the best way to keep updated"

which is their answer when I asked "if/when the issue would get resolved"

silver pulsar Feb 27, 2024, 5:32 AM

#

amber scroll got in touch with sales, they gave back the burned credits based on our 30 day a...

Thanks a ton for the response! I contacted them directly as well for now. Good to hear your side got (mostly? kinda?) resolved :]

amber scroll Feb 27, 2024, 5:33 AM

#

Still not fully resolved but at least they refunded the credits xd

silver pulsar Feb 27, 2024, 5:35 AM

#

Job execution times are normal, but the delay time caused a huge spike in credit consumption :[

#

Good to hear they refunded your side. Hoping for the same

hidden reef Feb 27, 2024, 6:15 AM

#

How do you contact sales? I need to contact them for a refund too..

silver pulsar Feb 27, 2024, 6:56 AM

#

hidden reef How do you contact sales? I need to contact them for a refund too..

I used their chat on their site. It's in the lower bottom right

dusk tree Mar 1, 2024, 3:48 AM

#

Hey @amber scroll @silver pulsar @hidden reef
I onboarded a huge load of hardware. However, the minimum RunPod should be able to do, is provide high quality communication, which I see wasn't ideal.

Zhen, Pardeep, Justin and me have been pushing hard on at least 5 different features to make Serverless much better at managing huge loads. Secondly, we hired 3 support staff, 2 cloud engineers, and looking for more support engineers as well. Communications must improve; and it will, trust me.

That being said, we value relationship above all else. All else. Hit me up in private and we will provide compensation for you.

amber scroll Mar 1, 2024, 4:03 AM

#

dusk tree Hey <@181174394536722432> <@167535363257139200> <@227739316791541762> I onboard...

That's a great resolution!

silver pulsar Mar 1, 2024, 4:16 AM

#

dusk tree Hey <@181174394536722432> <@167535363257139200> <@227739316791541762> I onboard...

For now I dmed you. Thank you for the ping!

amber scroll Mar 1, 2024, 4:24 AM

#

moved into DMs

#All 27 workers throttled