All 27 workers throttled
Our company needs stable aviability of minimum 10 workers. Quite recently the biggest part or even all workers are throttled. We arleady spent more than 800-1000$ on you service and would be pretty grateful whether there will be some stable amount of requested workers. IDS: 6lxilvs3rj0fl7, 97atmaayuoyhls. Our customers have to wait for hours...
153 Replies
Does your endpoint use network storage in RO region?
Network is in EU-CZ-1
Our company would be very grateful for the solution. The availability tends to stay the same for last few days. Due to huge waiting time we are losing money π¦ We were thinking of slowly increasing the amount up to 30+, but now we can't even have 5 stable working workers π¦
Yeah looks like its basically a no-go in that region, you may want to consider setting up a new endpoint in either EU-SE-1 or EU-NO-1 regions. I had this same issue with EU-RO-1 and had to create a new endpoint.
The thing is that the network itself doesnt allow other regions, even if i deploy it to any location
Yeah I created a new network volume as well.
Its very inconvenient but better than having down time and losing money.
https://discord.com/channels/912829806415085598/1194711850223415348
Can refer to this to how to copy data over in case downloading it from some other source not an option
https://discord.com/channels/912829806415085598/1209602115262095420
Also was something we gave as a feedback to @flash-singh . Sadly the fact that serverless workers can get fully throttled across the board on a region i find frustrating / insane too
Yeah it shouldn't happen that every single worker becomes throttled and brings down our production applications.
How often does this problem happen? We recently moved to serverless instead of gpu cloud, but the expirience is quite sad by far
Just wondering, how big are your models?
about 3gb, one model
Happens A LOT. Happened to me at least 3 or 4 times in the last 6 months.
probably even smaller
I think for the 4090s the 24gb Pro, it happens a decent amount. I try to avoid it and go 24gb + 48gb gpu.
Also if ur only 3gb
build it into the image instead
Ull get way way more flexibility
and less of this issue to where i dont have problems with those endpoints with 10+ workes
anything that is < 35gb
I build into my model
if it doesnt need dynamic switching
Already using 24 + 24 pro. Where can i find more info about this method?
All 24GB PRO in RO are gone , thats why all my workers in RO are throttled, in a matter of WEEKS, it went from high availbility for 4090 to nothing and all my workers throttled
And how long does it take to be resolved in average?
When you select, select 1 on the 48pros, and 2 as the 24gb.
Also, if you build the image into the model, and get off network storage, ull be able to use all data centers not just ones tied to network volume
Weeks, months, I move to a new endpoint
I saw someone recently @kopyl who was throttled for an hour. so i suggest in ur situation, move to building the model into the image, and shouldnt be an issue
48GB PRO is low availability, I don't recommand
The thing is i am using automatic1111 + custom model + LORAs
Same here
Im just sharing what i have, i get high on 16gb, and 48pro at least for me with no network region
dockerhub lets u have one private repo
that's what i do for my private stuff
unless u have more stuff
It always the 4090s that bottleneck me
WTF shows LOW for me without a network volume
So you manually push volumes to dockerhub and build from image directly?
u could be right ashelyk, just found out im throttled across the board
No not push volumes to dockerhub
U can just do some function call in ur dockerfile to download the model
Maybe became medium availability for a brief moment, workers are constantly moving around
this is so frustrating)))
ok i see wym
Thank you!
yea i asked flash about this before, and its b/c someone can just eat up all the gpus for their super big clients. Something im debating on is if i get fully throttled across the board, i use their graphql endpoint
to set a minimum of 2 active workers
to steal back workers
GitHub
GitHub - justinwlin/runpod-api: A collection of Python scripts for...
A collection of Python scripts for calling the RunPod GraphQL API - justinwlin/runpod-api
@ashleyk got a repo on that
It isnt an instant switch
but better than getting fully throttled
it seems to respect minimum workers
and prioritze it
And i will be able to use all data centers? The problem will be resolved or they still have this one sometimes even on the bigger amount of data centers?
Ull be able to use all data centers and not locked to a region
I think the problem will happen more rarely, @flash-singh supposedly has said if a worker is throttled for an hour, it terminates and switches it out, but that is crazy to me, why it would allow us to fall into an all worker throttle situation; also im not sure that really happens to be honest
so i recommend maybe to explore the minimum worker force scenario, b/c i ping the /health on my endpoint routinely
an ex of me pulling a minimum of 2 workers now
to forcefully get my workers back
maybe make ur numbers look like this
4090s are always eaten up, so should prob be the #3
or whatever the lowest number is
tbh idk what the numbers even do π€·π€·π€· which i complained about too
are you mostly looking for A5000s and 24gb mostly?
yes
EU-SE-1 is the best for that, EU-CZ-1 always has low quantity of those, and 3090s are always taken, were you looking for 3090s?
are you able to move storage?
we look for 24gb gpu, the model of gpu does not matter. I guess i can make a new storage in different data center
you can either make a new endpoint, or switch your current one to use EU-SE-1, currently that one has the biggest capacity for 48gb and 24gb and 16gb but they do not have 4090s
B/c he is a 3gb model, i think its better to just build into docker image in situations like that right? then he wouldn't be limited to a region?
and he can also just take out EU-CZ-1 from his region list
so he doesnt get assigned any there?
yes i would never use network volume if your running 1 static model
ty! will try the method above
yep pick global and it will automatically pick most available servers across all regions
EU-SE-1 has plenty of capacity but its also newer compared to most of other ones
do u guys plan to make a chart or something detailing this informatino at some point π¦ π
πππ
or do we only have to get this anecdotally
tbh im not using anything special, i just go click EU-SE-1 and see their all high
but yes we do need to get better at showing availability, we also have a bug with network storage tab showing you wrong availability, we are working on fixing that this week
i def understand the frustration, it causes us stress as well, but solving scale for GPUs, its more complicated and requires big investment, we are trying to push towards all directions to be better at this
yeah still thx u runpod for making gpu / ml saas businesses a whole lot easier lol
still many pain points as you can see, getting there by the day
By the way not using network storage doesn't even help, this endpoint of mine doesn't use any network storage and almost all my workers are throttled, this is a serious problem with 24GB GPU, basically zero availability anywhere.
Massive problem, we have a stand at the PBX Expo in Las Vegas and this is impacting our product demonstations π‘
CC: @JM
I don't understand, because if I edit my endpoint, it says "High Availability" for 24GB yet basically all my workers are throttled.
Not sure if this helps / u prob already did it, but I had to reset my max workers to 0, and then back to 12, and kick out EU-CZ-1 so I dont get assigned any of the GPUs from that region. I think the big problem with Runpod's worker right now is that it seems to only stay on the first assigned GPU, and cause i had the same experience about after editing my endpoints I was also throttled fully until i forcefully refreshed all the workers back.
Edit:
could setting minimum workers temporarily if the stand is active, temporarily relieve the issue? x.x..
/ @JM / @flash-singh hopefully can chime in tho .-. i also am confused what the best steps are in these situations; if we edit the endpoint do we need to refresh all the workers? what is the expected procedure..
Wow thats a major fail, if all my workers end up in CZ and get throttled, it should pick workers from somewhere else
Good question, changing priority made zero difference, I had to scale workers down to zero and back up again which sucks
Totally agree extremely frustrating
I moved all my endpoints to kick cz-1 out so im not assigned a bad region cause the priority algorithm rlly is bad and seems to do nothing
I changed all my endpoints from 24GB to 48GB, 24GB tier is totally and utterly fucked up and completely unusable and nice how nobody from RunPod bothers to fucking respond when we have a fucking PRODUCTION ISSUE. THIS IS TOTALLY UNACCEPTABLE!!!!!!!!!!!!!!!!!!!!!!!!
I am looking for a new provider in the morning, RunPod is utter shit if you can't get support.
cc @Zeen
https://discord.com/channels/912829806415085598/1209973235387474002
I agree, you guys need to change the priority algorithm, to something similar to my feedback. It at least needs to be visibly proactive trying to find workers, and start shifting at least two-three workers immediately out of throttle after like 5-10 seconds rather than letting it sit. Again, I have zero clue how the priority algorithm works, but we can't optimize anything to Runpod's specification cause there is nothing for us to specify. Honestly I'd even write my own priority algorithm if I could.
can you share endpoint id?
that seems like a bug
Ill let @ashleyk ping his endpoint when he can, but b/c I experienced it too:
qie98s97wqvw4t
This one is mine. Ik ashelyk's is more production critical, but it seems like a bug with the priority algorithm then if me / him are both able to get fully throttled. I mean its fixed following the steps I said, to reset max workers to 0, shift my priorities around, kick CZ out, but I just wonder why I need to manually do this, and scale all my workers to 0 myself, rather than the priority algorithm handling this for me.
Also if the editing of workers is sensed and updated, it should really try to recalculate all the throttled workers and begin to try to shift them over if there is avaliability, i think that is why ashelyk / i was confused, when editing out endpoint and nothing happensi see all 21 workers are idle, so whats likely happening is there is a huge spike of work which takes many gpus, and that slows down
U said the throttle is switched out every hour before, is it possible to move 2-3 of them actively before that hour is hit? Also I think its b/c he refreshed all his workers
https://discord.com/channels/912829806415085598/1209942179527663667/1209970269108707398
Where he had to scale them all to zero and back
we will have to optimize that further but right now a huge spike will cause throttle and that will wind down after few mins
this is showing all idle now
I think this is a bug then, its not a few mins
Yeah it is
b/c he changed it
but he obvs had the convoersation longer than 3 mins
maybe ashelyk can share his graph at a closer time scale but im sure he got fully throttle
got it, so he must reset the workers
oh i do see throttle spike, then init spike so he must reset it
Yeah, I guess, then my question is this a bug with the priority algorithm?
What do u mean reset?
set max to 0
Okay, so there no way to do this automatically?
it's not a bug as much as priority algo isn't good
we do it automatically but it occurs hourly, will need to optimize that
we're thinking to just allow users to set a quota per gpu type in addition to assigning launch priority
what happened in the past few days is that a few of our larger customers flexed up 600+ serverless workers
Is it possible to guarantee like a 2 worker minimum to do it immediately? I think that would even fix the current issues
ANd also if someone manually changes it to start searching for new gpus if any are throttled?
I guess the problem is that ashelyk had to manually scale to 0 in a production env
if we could even scale down to half and scale back up
that be nice
yeah have to optimize that to take these conditions into account
I see, i guess my next question is it possible for me to terminate workers through the graphql endpoint?
https://graphql-spec.runpod.io/#definition-PodStatus
Cause I want to write a script on my server to force minimum workers or terminate throttled workers if I have jobs in the queue, and I need it to be more proactive
Do I treat it like a pod?
yes its similar, i plan to optimize this either way
Yeah i guess do u know when it will be estimated to be optimized?
i guess im looking into it cause I want to start feeding it more requests soon to my LLM / stuff, but Ill write the script depending on the time frame to just have minimum workers dynamically set if i have to
thank u tho, appreciate that the priority algorithm can be looked into / optimized / hopefully shared what its doing at some point too after reoptimized. I guess the fact that its an hour in a state of throttle, is a very badly known fact.
whats your endpoint id? let me check logs for it
I mean its not an issue for me,
qie98s97wqvw4t
b/c im not in a production env like ashelyk is, im just setting it up so that I can start testing > and moving my whole pipeline through cause I was relying on ChatGPT and it was costing it too much. But I commented in bc when this conversation started, and I wanted to share how not using a network volume could give u better avaliability:
https://discord.com/channels/912829806415085598/1209942179527663667/1209946232131297320
I myself was throttled across the board in my to-be example of you shouldnt rely on network storage - but honestly, ive posted about this multiple times in the past too, and i guess as zeen said u guys have experienced insane uptick in the last 3 daysplanning on releasing optimizations tomorrow, have to tweak the knobs carefully otherwise it causes network issues
Great, Im glad. If those release optimizations end up being done, do you think can tell us what it ends up being? So we know what to be aware of what the changes are?
Thank you
https://discord.com/channels/912829806415085598/1209973235387474002
Again, I think the biggest issue @ashleyk (and honestly even anyone else who would be using runpod in production) and why it wouldn't be taken srsly is b/c if u are fully throttled across the board and have no options to fix avaliability that really is the worst nightmare.
ill share what i can here
thanks!
sorry for hammering u guys so much ποΈ know there is a lot behind the scenes
we are here to support, something we need to optimize regardless
So basically what you are saying is that money is more important to RunPod than providing a stable service to all customers that RunPod can increase the number of workers for larger customers to such an extent that it takes down the endpoints of all other customers? π‘
@flash-singh my endpoint was idle because 24GB tier is unusable and I had to change it 48GB tier and scale it down and back up again because editing the endpoint is shit and can't update automatically.
Yeah, hopefully tho the coming changes that he proposes this week will fix it
https://discord.com/channels/912829806415085598/1209973235387474002/1210002895781625907
Definitely is an issue that I think they will work to address, and let's see where it goes. i am glad to see that the hour throttle will drop down to 4 mins to start swapping things around + allow movement with less restrictions so hopefully runpod's algorithm will be a heck lot more proactive
No we had an internal discussion and all agreed that the quota shouldn't have been increased in this case.
@flash-singh i just want to thank you for your job and your product. Despite some throttling problems our company really appreciates the desire to fix problems instead of ignoring customers as most support team do
I have a few questions here. What exactly is best practice when availability runs low in the region where we have a network volume. Should we keep endpoints active in multiple regions?
On a similar note, is there a best practice regarding when to use a network volume and when to bundle models into our image? If we have 20gb of models, should that all just be bundled or should we be using a network volume?
I think this should be bundled, tbh. I find < 30gb for the compressed image shown on dockerhub quite safe, this is an example of my Mistral one.
https://hub.docker.com/layers/justinwlin/mistral7b_openllm/latest/images/sha256-47f901971ee95cd0d762fe244c4dd625a8bf7a0e0142e5bbd91ee76f61c8b6ef?context=repo
Haha, I saw you respond in the different thread, but Ill continue to answer here
The number just comes from trial and error anecdotally
If you get too high, the download time to serverless initialization becomes impossible. So I find that < 30gb is reasonable first initialization time. Once you start pushing that boundary, I just find it personally a bit weird.
Ok, I'll give it a shot. That implies that I could ditch the network volume and use the global region which should help tremendously with availability.
The runpod base image, is what I tend to use, so there is some cost there, but if you want to optimize it to the core, I saved maybe 1-2 gbs, not using the runpod-pytorch as a starting point.
https://github.com/justinwlin/runpodWhisperx/blob/master/Dockerfile
But tbh, nowadays i just end up building on it cause it saves me a lot of headache:
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/Dockerfile
Yeah, u wont get locked in per region. Also another thing is the priorities do matter. It tries to assign u a lot of whatever you put as (1) when you first initialize, I try to put a (1) priority on 24gb, or 48gb, but not on 24gb pro.
The 24gb pro and 48gb is like very similar in cost, but the 24gb pro just isn't worth the headaches it gives
I also got rid of the EU-CZ-1 region, cause I dont want to get assigned any GPUs from there cause that region got some avaliability issues around the 24gb pro it seems. Im sure the changes Flash is making will get throttled workers to move around way better, but I rather just not deal with it
example what i mean
This is helpful, thank you Justin!
Still encountering this issue trying to get 4090s as of this afternoon:
Yep, all my workers are throttled again too, RunPod serverless is pretty unusable at the moment
I even have 2 different endpoints in different regions and they are both throttled
4090s are too high in demand right now and more supply will be added in 1-2 weeks
48gbs were all throttled in CA today too.
Yes my endpoints are 48GB in SE and CA and both fully throttled. Also my 24GB without network storage and thus no region affinity also fully throttled. Serverless is a joke. I'm an enterprise customer but all my endpoints are fully throttled and cannot get support from RunPod so I'm taking my business elsewhere because this is totally unacceptable @flash-singh @Zeen @JM
Hey, I know not much I can say after the fact can fix past pain, but we have made a few platform releases to improve the throttling in the past day as well as added more capacity (way more coming next week). We've got a lot of customer using serverless and we've experience a spike in consumption usage that is just enormous and we're trying our best to handle it. We apologize for affecting your business and we are trying our best to find a balance between action and messaging.
Still getting throttled constantly. Serverless doesn't seem viable in its current state. Bummer. The tech is cool.
Itβs insane to me that Iβm just getting throttled out of the blue without a heads-up
All of my workers just wonβt start and every previously working GPU is now unavailable
This happened yesterday in EU-SE1 and now today in EUR-NO-1
Whatβs happening? @Zeen @flash-singh @JM
Looks like RunPod may have fixed something aroung 3.5 hours ago, all my endpoints throttled workers seem to have recovered around the same time.
Looks like I spoke too soon, they looked better for a short while, now getting throttled again.
This sucks
Basically no GPUs available in NO, SE has some 16GB and 24GB
SE
NO
I don't understand whats going on though because in NO I have no throttled workers.
same issue with throttled workers... personally think RunPod has to scale up at this point ASAP
previously we can stand on just using
A5000
s and only 4090
s were in throttling hell... but now, even that is throttled indefinitely
the issue has been happening for several days now, and the obvious solution of "just use 'active workers' " isn't really viable at our small scale, because doing that would be just like paying for the machines directly...
we are running a community supported projectThe lack of communication is really concerning
Same here, running production site. This happened to me before (I moved from US to EU) for availability and now it happened in EU again.
we have tweaked the algos but at certain points in the day the spikes eat up all the capacity, we are adding more gpus this week for A5000 and 4090s
I think you need to add more network capacity too, too many machines on the same network seems to be causing issues where everyone is experiencing slow speeds, serverless getting connection timed out issues, peoples pods disappearing etc etc.
I just had to terminate workers for an endpoint because they were getting stuck for 5mins on a job that takeds 14 seconds, due to network connectivity issues. Then a new worker spawned and also got stuck eating up all my credits and the job doesn't even get processed, it gets stuck on IN_PROGRESS.
My manager has demanded a refund for this because its unacceptable.
I just had to terminate workers for an endpoint because they were getting stuck for 5mins on a job that takeds 14 seconds, due to network connectivity issues. Then a new worker spawned and also got stuck eating up all my credits and the job doesn't even get processed, it gets stuck on IN_PROGRESS. My manager has demanded a refund for this because its unacceptable.This also happens to us... we were getting charged for 10+ minutes for a worker that kept "queueing image for pull" and the job was still
IN_QUEUE
... I was gonna report it but I didn't know if we were actually being charged or if it was just a UI thing
we chewed through $3 of credits in ~24 hours when we usually only spend $0.74/day as per our size... and our jobs only took 2-3s
it actually happened twice, and that was when I was there to see it... so it's definitely been doing that multiple times per hour@marshall are you using
latest
tag for your Docker image?we have our own tagging system that tags images based on the commit message
I don't think it's very much relevant to the issue, but the tag was
sm-q
, hosted on our private docker registryIs it possible to push a new image to the same tag?
I guess so-? but runpod caches the images per-datacenter, so that usually just happens in development... which is why we have semver for dev images
the image pulls just fine and we use it in prod, the issue is on the worker's side... infinitely "queuing image for pull" and us getting charged for a job that's not even in progress
the issue occured again:
it's been doing that for 3 minutes.
and we're getting charged for it...
so far in the past 30 minutes, runpod has chewed through 10 cents
if we calculate how much requests that would've been:
0.1 / (0.00026 * 3)
, it means we should've been receiving ~128.21
requests in that past 30 minutesthis doesn't look like 128 requests to me:
not even close
@flash-singh sorry for the direct ping but uh, it's actually chewing through our balance, another 8 cents has just been deducted.
what do we do?
2 cents deducted out of nowhere, there are no jobs running across all endpoints
its just 1 worker? terminate that for now, ill look into the bug
we tried setting max worker count to 8 to try and see if that will improve the delay time... it didn't
due to throttled workers?
Yupp
higher max workers can help but ideally much of the compute is saturated and expansion is already planned this week for some gpus
What we're also thinking is that it might be deducting from cancelled jobs
the timer goes up each refresh, and these jobs were previously cancelled due to them taking too long... and our systems just cancelled them to prevent too much usage...
the timeout is set to 120s (queueing included)
cancelled wont charge once triggered, we stop those workers running the job
holy crap
I think the best way to go for now is to shutdown our AI chatbot feature until this infrastructure issue is fixed
we can't have our contributors' money wasted over runpod's scaling issue
if this goes unwatched, who knows how much money it'll siphon out
and we aren't certain if we're going to get refunded for this
tried contacting sales... welp.
currently trying to run a smaller version of our model on 16GB temporarily
1 week of downtime is too big of an impact for us apparently
@marshall hey was your issue ever resolved? I looked through my logs and saw a sudden huge spike in credit consumption for just a couple jobs. It looks like the "delay" time it took to even run the job was counted into the actual gpu usage :T
I'd like to add it was also on the same dates as your issues. Feb 24/25
got in touch with sales, they gave back the burned credits based on our 30 day average
Right now we're running the model on 16GB which is a bit more expensive due to the longer inference time (despite being 30% cheaper, the model took 60% longer to produce output)
so ideally we should go back to 24GB, but we'll have to wait for RunPod's announcement regarding GPU availability... According to sales:
"It's probably going to be a gradient over time rather than a binary state of being resolved/not resolved since we add more capacity on a weekly/biweekly basis; we do announce big supply adds on Discord when they come through so that's probably the best way to keep updated"which is their answer when I asked "if/when the issue would get resolved"
Thanks a ton for the response! I contacted them directly as well for now. Good to hear your side got (mostly? kinda?) resolved :]
Still not fully resolved but at least they refunded the credits xd
Job execution times are normal, but the delay time caused a huge spike in credit consumption :[
Good to hear they refunded your side. Hoping for the same
How do you contact sales? I need to contact them for a refund too..
I used their chat on their site. It's in the lower bottom right
Hey @marshall @HyS | The World of Ylvera @ashleyk
I onboarded a huge load of hardware. However, the minimum RunPod should be able to do, is provide high quality communication, which I see wasn't ideal.
Zhen, Pardeep, Justin and me have been pushing hard on at least 5 different features to make Serverless much better at managing huge loads. Secondly, we hired 3 support staff, 2 cloud engineers, and looking for more support engineers as well. Communications must improve; and it will, trust me.
That being said, we value relationship above all else. All else. Hit me up in private and we will provide compensation for you.
That's a great resolution!
For now I dmed you. Thank you for the ping!
moved into DMs
Sure, thanks both!