#All 27 workers throttled

238 messages · Page 1 of 1 (latest)

stuck nymph
#

Our company needs stable aviability of minimum 10 workers. Quite recently the biggest part or even all workers are throttled. We arleady spent more than 800-1000$ on you service and would be pretty grateful whether there will be some stable amount of requested workers. IDS: 6lxilvs3rj0fl7, 97atmaayuoyhls. Our customers have to wait for hours...

serene boneBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

hidden reef
#

Does your endpoint use network storage in RO region?

stuck nymph
#

Network is in EU-CZ-1

#

Our company would be very grateful for the solution. The availability tends to stay the same for last few days. Due to huge waiting time we are losing money 😦 We were thinking of slowly increasing the amount up to 30+, but now we can't even have 5 stable working workers 😦

hidden reef
#

Yeah looks like its basically a no-go in that region, you may want to consider setting up a new endpoint in either EU-SE-1 or EU-NO-1 regions. I had this same issue with EU-RO-1 and had to create a new endpoint.

stuck nymph
#

The thing is that the network itself doesnt allow other regions, even if i deploy it to any location

hidden reef
#

Yeah I created a new network volume as well.

#

Its very inconvenient but better than having down time and losing money.

quaint oyster
hidden reef
#

Yeah it shouldn't happen that every single worker becomes throttled and brings down our production applications.

stuck nymph
#

How often does this problem happen? We recently moved to serverless instead of gpu cloud, but the expirience is quite sad by far

quaint oyster
stuck nymph
hidden reef
#

Happens A LOT. Happened to me at least 3 or 4 times in the last 6 months.

stuck nymph
#

probably even smaller

quaint oyster
#

build it into the image instead

#

Ull get way way more flexibility

#

and less of this issue to where i dont have problems with those endpoints with 10+ workes

#

anything that is < 35gb

#

I build into my model

#

if it doesnt need dynamic switching

stuck nymph
hidden reef
#

All 24GB PRO in RO are gone , thats why all my workers in RO are throttled, in a matter of WEEKS, it went from high availbility for 4090 to nothing and all my workers throttled

stuck nymph
quaint oyster
hidden reef
quaint oyster
hidden reef
#

48GB PRO is low availability, I don't recommand

stuck nymph
quaint oyster
#

Im just sharing what i have, i get high on 16gb, and 48pro at least for me with no network region

quaint oyster
#

that's what i do for my private stuff

#

unless u have more stuff

#

It always the 4090s that bottleneck me

hidden reef
#

WTF shows LOW for me without a network volume

stuck nymph
quaint oyster
#

u could be right ashelyk, just found out im throttled across the board

quaint oyster
#

U can just do some function call in ur dockerfile to download the model

hidden reef
#

Maybe became medium availability for a brief moment, workers are constantly moving around

quaint oyster
#

here is an example

stuck nymph
stuck nymph
#

Thank you!

quaint oyster
# stuck nymph this is so frustrating)))

yea i asked flash about this before, and its b/c someone can just eat up all the gpus for their super big clients. Something im debating on is if i get fully throttled across the board, i use their graphql endpoint

#

to set a minimum of 2 active workers

#

to steal back workers

#

@hidden reef got a repo on that

#

It isnt an instant switch

#

but better than getting fully throttled

#

it seems to respect minimum workers

#

and prioritze it

stuck nymph
# stuck nymph ok i see wym

And i will be able to use all data centers? The problem will be resolved or they still have this one sometimes even on the bigger amount of data centers?

quaint oyster
#

I think the problem will happen more rarely, @pallid imp supposedly has said if a worker is throttled for an hour, it terminates and switches it out, but that is crazy to me, why it would allow us to fall into an all worker throttle situation; also im not sure that really happens to be honest

#

so i recommend maybe to explore the minimum worker force scenario, b/c i ping the /health on my endpoint routinely

#

an ex of me pulling a minimum of 2 workers now

#

to forcefully get my workers back

#

maybe make ur numbers look like this

#

4090s are always eaten up, so should prob be the #3

#

or whatever the lowest number is

#

tbh idk what the numbers even do 🤷🤷🤷 which i complained about too

pallid imp
#

are you mostly looking for A5000s and 24gb mostly?

stuck nymph
#

yes

pallid imp
#

EU-SE-1 is the best for that, EU-CZ-1 always has low quantity of those, and 3090s are always taken, were you looking for 3090s?

#

are you able to move storage?

stuck nymph
#

we look for 24gb gpu, the model of gpu does not matter. I guess i can make a new storage in different data center

pallid imp
#

you can either make a new endpoint, or switch your current one to use EU-SE-1, currently that one has the biggest capacity for 48gb and 24gb and 16gb but they do not have 4090s

quaint oyster
#

and he can also just take out EU-CZ-1 from his region list

#

so he doesnt get assigned any there?

pallid imp
#

yes i would never use network volume if your running 1 static model

stuck nymph
pallid imp
#

yep pick global and it will automatically pick most available servers across all regions

#

EU-SE-1 has plenty of capacity but its also newer compared to most of other ones

quaint oyster
#

😔😔😔

#

or do we only have to get this anecdotally

pallid imp
#

tbh im not using anything special, i just go click EU-SE-1 and see their all high

#

but yes we do need to get better at showing availability, we also have a bug with network storage tab showing you wrong availability, we are working on fixing that this week

#

i def understand the frustration, it causes us stress as well, but solving scale for GPUs, its more complicated and requires big investment, we are trying to push towards all directions to be better at this

quaint oyster
pallid imp
#

still many pain points as you can see, getting there by the day

hidden reef
#

By the way not using network storage doesn't even help, this endpoint of mine doesn't use any network storage and almost all my workers are throttled, this is a serious problem with 24GB GPU, basically zero availability anywhere.

#

Massive problem, we have a stand at the PBX Expo in Las Vegas and this is impacting our product demonstations 😡
CC: @dusk tree

#

I don't understand, because if I edit my endpoint, it says "High Availability" for 24GB yet basically all my workers are throttled.

quaint oyster
# hidden reef I don't understand, because if I edit my endpoint, it says "High Availability" f...

Not sure if this helps / u prob already did it, but I had to reset my max workers to 0, and then back to 12, and kick out EU-CZ-1 so I dont get assigned any of the GPUs from that region. I think the big problem with Runpod's worker right now is that it seems to only stay on the first assigned GPU, and cause i had the same experience about after editing my endpoints I was also throttled fully until i forcefully refreshed all the workers back.

Edit:
could setting minimum workers temporarily if the stand is active, temporarily relieve the issue? x.x..

quaint oyster
hidden reef
#

Wow thats a major fail, if all my workers end up in CZ and get throttled, it should pick workers from somewhere else

hidden reef
quaint oyster
#

I moved all my endpoints to kick cz-1 out so im not assigned a bad region cause the priority algorithm rlly is bad and seems to do nothing

hidden reef
#

I changed all my endpoints from 24GB to 48GB, 24GB tier is totally and utterly fucked up and completely unusable and nice how nobody from RunPod bothers to fucking respond when we have a fucking PRODUCTION ISSUE. THIS IS TOTALLY UNACCEPTABLE!!!!!!!!!!!!!!!!!!!!!!!!

#

I am looking for a new provider in the morning, RunPod is utter shit if you can't get support.

#

cc @past lance

quaint oyster
# quaint oyster / <@244335936031293440> / <@210036719238512640> hopefully can chime in tho .-. i...

https://discord.com/channels/912829806415085598/1209973235387474002

I agree, you guys need to change the priority algorithm, to something similar to my feedback. It at least needs to be visibly proactive trying to find workers, and start shifting at least two-three workers immediately out of throttle after like 5-10 seconds rather than letting it sit. Again, I have zero clue how the priority algorithm works, but we can't optimize anything to Runpod's specification cause there is nothing for us to specify. Honestly I'd even write my own priority algorithm if I could.

pallid imp
#

that seems like a bug

quaint oyster
# pallid imp can you share endpoint id?

Ill let @hidden reef ping his endpoint when he can, but b/c I experienced it too:
qie98s97wqvw4t

This one is mine. Ik ashelyk's is more production critical, but it seems like a bug with the priority algorithm then if me / him are both able to get fully throttled. I mean its fixed following the steps I said, to reset max workers to 0, shift my priorities around, kick CZ out, but I just wonder why I need to manually do this, and scale all my workers to 0 myself, rather than the priority algorithm handling this for me.

#

Also if the editing of workers is sensed and updated, it should really try to recalculate all the throttled workers and begin to try to shift them over if there is avaliability, i think that is why ashelyk / i was confused, when editing out endpoint and nothing happens

pallid imp
#

i see all 21 workers are idle, so whats likely happening is there is a huge spike of work which takes many gpus, and that slows down

quaint oyster
#

Where he had to scale them all to zero and back

pallid imp
quaint oyster
#

Yeah it is

quaint oyster
#

but he obvs had the convoersation longer than 3 mins

quaint oyster
pallid imp
#

got it, so he must reset the workers

#

oh i do see throttle spike, then init spike so he must reset it

quaint oyster
quaint oyster
pallid imp
#

set max to 0

quaint oyster
past lance
#

it's not a bug as much as priority algo isn't good

pallid imp
#

we do it automatically but it occurs hourly, will need to optimize that

past lance
#

we're thinking to just allow users to set a quota per gpu type in addition to assigning launch priority

#

what happened in the past few days is that a few of our larger customers flexed up 600+ serverless workers

quaint oyster
#

ANd also if someone manually changes it to start searching for new gpus if any are throttled?

#

I guess the problem is that ashelyk had to manually scale to 0 in a production env

#

if we could even scale down to half and scale back up

#

that be nice

pallid imp
quaint oyster
#

Do I treat it like a pod?

pallid imp
#

yes its similar, i plan to optimize this either way

quaint oyster
# pallid imp yes its similar, i plan to optimize this either way

Yeah i guess do u know when it will be estimated to be optimized?

i guess im looking into it cause I want to start feeding it more requests soon to my LLM / stuff, but Ill write the script depending on the time frame to just have minimum workers dynamically set if i have to

quaint oyster
pallid imp
#

whats your endpoint id? let me check logs for it

quaint oyster
# pallid imp whats your endpoint id? let me check logs for it

I mean its not an issue for me, qie98s97wqvw4t b/c im not in a production env like ashelyk is, im just setting it up so that I can start testing > and moving my whole pipeline through cause I was relying on ChatGPT and it was costing it too much. But I commented in bc when this conversation started, and I wanted to share how not using a network volume could give u better avaliability:
#1209942179527663667 message

I myself was throttled across the board in my to-be example of you shouldnt rely on network storage - but honestly, ive posted about this multiple times in the past too, and i guess as zeen said u guys have experienced insane uptick in the last 3 days

pallid imp
#

planning on releasing optimizations tomorrow, have to tweak the knobs carefully otherwise it causes network issues

quaint oyster
#

Thank you

#

https://discord.com/channels/912829806415085598/1209973235387474002

Again, I think the biggest issue @hidden reef (and honestly even anyone else who would be using runpod in production) and why it wouldn't be taken srsly is b/c if u are fully throttled across the board and have no options to fix avaliability that really is the worst nightmare.

pallid imp
#

ill share what i can here

quaint oyster
#

sorry for hammering u guys so much 👁️ know there is a lot behind the scenes

pallid imp
#

we are here to support, something we need to optimize regardless

hidden reef
#

@pallid imp my endpoint was idle because 24GB tier is unusable and I had to change it 48GB tier and scale it down and back up again because editing the endpoint is shit and can't update automatically.

quaint oyster
# hidden reef <@210036719238512640> my endpoint was idle because 24GB tier is unusable and I h...

Yeah, hopefully tho the coming changes that he proposes this week will fix it

#1209973235387474002 message

Definitely is an issue that I think they will work to address, and let's see where it goes. i am glad to see that the hour throttle will drop down to 4 mins to start swapping things around + allow movement with less restrictions so hopefully runpod's algorithm will be a heck lot more proactive

past lance
stuck nymph
#

@pallid imp i just want to thank you for your job and your product. Despite some throttling problems our company really appreciates the desire to fix problems instead of ignoring customers as most support team do

sage osprey
#

I have a few questions here. What exactly is best practice when availability runs low in the region where we have a network volume. Should we keep endpoints active in multiple regions?

#

On a similar note, is there a best practice regarding when to use a network volume and when to bundle models into our image? If we have 20gb of models, should that all just be bundled or should we be using a network volume?

quaint oyster
quaint oyster
#

The number just comes from trial and error anecdotally

#

If you get too high, the download time to serverless initialization becomes impossible. So I find that < 30gb is reasonable first initialization time. Once you start pushing that boundary, I just find it personally a bit weird.

sage osprey
#

Ok, I'll give it a shot. That implies that I could ditch the network volume and use the global region which should help tremendously with availability.

quaint oyster
quaint oyster
#

I also got rid of the EU-CZ-1 region, cause I dont want to get assigned any GPUs from there cause that region got some avaliability issues around the 24gb pro it seems. Im sure the changes Flash is making will get throttled workers to move around way better, but I rather just not deal with it

#

example what i mean

sage osprey
#

This is helpful, thank you Justin!

weak igloo
#

Still encountering this issue trying to get 4090s as of this afternoon:

hidden reef
#

Yep, all my workers are throttled again too, RunPod serverless is pretty unusable at the moment

#

I even have 2 different endpoints in different regions and they are both throttled

pallid imp
#

4090s are too high in demand right now and more supply will be added in 1-2 weeks

hearty tulip
#

48gbs were all throttled in CA today too.

hidden reef
#

Yes my endpoints are 48GB in SE and CA and both fully throttled. Also my 24GB without network storage and thus no region affinity also fully throttled. Serverless is a joke. I'm an enterprise customer but all my endpoints are fully throttled and cannot get support from RunPod so I'm taking my business elsewhere because this is totally unacceptable @pallid imp @past lance @dusk tree

past lance
# hidden reef Yes my endpoints are 48GB in SE and CA and both fully throttled. Also my 24GB wi...

Hey, I know not much I can say after the fact can fix past pain, but we have made a few platform releases to improve the throttling in the past day as well as added more capacity (way more coming next week). We've got a lot of customer using serverless and we've experience a spike in consumption usage that is just enormous and we're trying our best to handle it. We apologize for affecting your business and we are trying our best to find a balance between action and messaging.

hearty tulip
#

Still getting throttled constantly. Serverless doesn't seem viable in its current state. Bummer. The tech is cool.

marsh bay
#

It’s insane to me that I’m just getting throttled out of the blue without a heads-up

#

All of my workers just won’t start and every previously working GPU is now unavailable

#

This happened yesterday in EU-SE1 and now today in EUR-NO-1

#

What’s happening? @past lance @pallid imp @dusk tree

hidden reef
#

Looks like RunPod may have fixed something aroung 3.5 hours ago, all my endpoints throttled workers seem to have recovered around the same time.

hidden reef
#

Looks like I spoke too soon, they looked better for a short while, now getting throttled again.

marsh bay
#

This sucks

hidden reef
#

I don't understand whats going on though because in NO I have no throttled workers.

amber scroll
#

same issue with throttled workers... personally think RunPod has to scale up at this point ASAP

previously we can stand on just using A5000s and only 4090s were in throttling hell... but now, even that is throttled indefinitely

the issue has been happening for several days now, and the obvious solution of "just use 'active workers' " isn't really viable at our small scale, because doing that would be just like paying for the machines directly...

we are running a community supported project

marsh bay
#

The lack of communication is really concerning

digital vigil
#

Same here, running production site. This happened to me before (I moved from US to EU) for availability and now it happened in EU again.

pallid imp
#

we have tweaked the algos but at certain points in the day the spikes eat up all the capacity, we are adding more gpus this week for A5000 and 4090s

hidden reef
#

I think you need to add more network capacity too, too many machines on the same network seems to be causing issues where everyone is experiencing slow speeds, serverless getting connection timed out issues, peoples pods disappearing etc etc.

#

I just had to terminate workers for an endpoint because they were getting stuck for 5mins on a job that takeds 14 seconds, due to network connectivity issues. Then a new worker spawned and also got stuck eating up all my credits and the job doesn't even get processed, it gets stuck on IN_PROGRESS.

#

My manager has demanded a refund for this because its unacceptable.

amber scroll
#

I just had to terminate workers for an endpoint because they were getting stuck for 5mins on a job that takeds 14 seconds, due to network connectivity issues. Then a new worker spawned and also got stuck eating up all my credits and the job doesn't even get processed, it gets stuck on IN_PROGRESS.
My manager has demanded a refund for this because its unacceptable.

This also happens to us... we were getting charged for 10+ minutes for a worker that kept "queueing image for pull" and the job was still IN_QUEUE... I was gonna report it but I didn't know if we were actually being charged or if it was just a UI thing

#

we chewed through $3 of credits in ~24 hours when we usually only spend $0.74/day as per our size... and our jobs only took 2-3s

amber scroll
hidden reef
#

@amber scroll are you using latest tag for your Docker image?

amber scroll
#

I don't think it's very much relevant to the issue, but the tag was sm-q, hosted on our private docker registry

hidden reef
amber scroll
#

I guess so-? but runpod caches the images per-datacenter, so that usually just happens in development... which is why we have semver for dev images

#

the image pulls just fine and we use it in prod, the issue is on the worker's side... infinitely "queuing image for pull" and us getting charged for a job that's not even in progress

amber scroll
#

the issue occured again:

2024-02-25T18:28:18Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:28:34Z create container ***/serverless-llm:sm-q
2024-02-25T18:28:34Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:28:51Z create container ***:sm-q
2024-02-25T18:28:51Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:29:07Z create container ***/serverless-llm:sm-q
2024-02-25T18:29:07Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.

it's been doing that for 3 minutes.

#

and we're getting charged for it...

#

so far in the past 30 minutes, runpod has chewed through 10 cents

#

if we calculate how much requests that would've been: 0.1 / (0.00026 * 3), it means we should've been receiving ~128.21 requests in that past 30 minutes

#

this doesn't look like 128 requests to me:

#

not even close

#

@pallid imp sorry for the direct ping but uh, it's actually chewing through our balance, another 8 cents has just been deducted.

what do we do?

#

2 cents deducted out of nowhere, there are no jobs running across all endpoints

pallid imp
#

its just 1 worker? terminate that for now, ill look into the bug

amber scroll
#

we tried setting max worker count to 8 to try and see if that will improve the delay time... it didn't

pallid imp
#

due to throttled workers?

amber scroll
#

Yupp

pallid imp
#

higher max workers can help but ideally much of the compute is saturated and expansion is already planned this week for some gpus

amber scroll
#

What we're also thinking is that it might be deducting from cancelled jobs

the timer goes up each refresh, and these jobs were previously cancelled due to them taking too long... and our systems just cancelled them to prevent too much usage...

the timeout is set to 120s (queueing included)

pallid imp
#

cancelled wont charge once triggered, we stop those workers running the job

amber scroll
#

holy crap

#

I think the best way to go for now is to shutdown our AI chatbot feature until this infrastructure issue is fixed

#

we can't have our contributors' money wasted over runpod's scaling issue

#

if this goes unwatched, who knows how much money it'll siphon out

#

and we aren't certain if we're going to get refunded for this

#

tried contacting sales... welp.

amber scroll
#

currently trying to run a smaller version of our model on 16GB temporarily
1 week of downtime is too big of an impact for us apparently

silver pulsar
#

@amber scroll hey was your issue ever resolved? I looked through my logs and saw a sudden huge spike in credit consumption for just a couple jobs. It looks like the "delay" time it took to even run the job was counted into the actual gpu usage :T

#

I'd like to add it was also on the same dates as your issues. Feb 24/25

amber scroll
# silver pulsar <@181174394536722432> hey was your issue ever resolved? I looked through my logs...

got in touch with sales, they gave back the burned credits based on our 30 day average

Right now we're running the model on 16GB which is a bit more expensive due to the longer inference time (despite being 30% cheaper, the model took 60% longer to produce output)

so ideally we should go back to 24GB, but we'll have to wait for RunPod's announcement regarding GPU availability... According to sales:

"It's probably going to be a gradient over time rather than a binary state of being resolved/not resolved since we add more capacity on a weekly/biweekly basis; we do announce big supply adds on Discord when they come through so that's probably the best way to keep updated"

which is their answer when I asked "if/when the issue would get resolved"

silver pulsar
amber scroll
#

Still not fully resolved but at least they refunded the credits xd

silver pulsar
#

Job execution times are normal, but the delay time caused a huge spike in credit consumption :[

#

Good to hear they refunded your side. Hoping for the same

hidden reef
#

How do you contact sales? I need to contact them for a refund too..

silver pulsar
dusk tree
#

Hey @amber scroll @silver pulsar @hidden reef
I onboarded a huge load of hardware. However, the minimum RunPod should be able to do, is provide high quality communication, which I see wasn't ideal.

Zhen, Pardeep, Justin and me have been pushing hard on at least 5 different features to make Serverless much better at managing huge loads. Secondly, we hired 3 support staff, 2 cloud engineers, and looking for more support engineers as well. Communications must improve; and it will, trust me.

That being said, we value relationship above all else. All else. Hit me up in private and we will provide compensation for you.

silver pulsar
amber scroll
#

moved into DMs