R
RunPodβ€’5mo ago
Jidovenok

All 27 workers throttled

Our company needs stable aviability of minimum 10 workers. Quite recently the biggest part or even all workers are throttled. We arleady spent more than 800-1000$ on you service and would be pretty grateful whether there will be some stable amount of requested workers. IDS: 6lxilvs3rj0fl7, 97atmaayuoyhls. Our customers have to wait for hours...
153 Replies
ashleyk
ashleykβ€’5mo ago
Does your endpoint use network storage in RO region?
Jidovenok
Jidovenokβ€’5mo ago
Network is in EU-CZ-1
Jidovenok
Jidovenokβ€’5mo ago
Our company would be very grateful for the solution. The availability tends to stay the same for last few days. Due to huge waiting time we are losing money 😦 We were thinking of slowly increasing the amount up to 30+, but now we can't even have 5 stable working workers 😦
No description
ashleyk
ashleykβ€’5mo ago
Yeah looks like its basically a no-go in that region, you may want to consider setting up a new endpoint in either EU-SE-1 or EU-NO-1 regions. I had this same issue with EU-RO-1 and had to create a new endpoint.
Jidovenok
Jidovenokβ€’5mo ago
The thing is that the network itself doesnt allow other regions, even if i deploy it to any location
ashleyk
ashleykβ€’5mo ago
Yeah I created a new network volume as well. Its very inconvenient but better than having down time and losing money.
justin
justinβ€’5mo ago
https://discord.com/channels/912829806415085598/1194711850223415348 Can refer to this to how to copy data over in case downloading it from some other source not an option https://discord.com/channels/912829806415085598/1209602115262095420 Also was something we gave as a feedback to @flash-singh . Sadly the fact that serverless workers can get fully throttled across the board on a region i find frustrating / insane too
ashleyk
ashleykβ€’5mo ago
Yeah it shouldn't happen that every single worker becomes throttled and brings down our production applications.
Jidovenok
Jidovenokβ€’5mo ago
How often does this problem happen? We recently moved to serverless instead of gpu cloud, but the expirience is quite sad by far
justin
justinβ€’5mo ago
Just wondering, how big are your models?
Jidovenok
Jidovenokβ€’5mo ago
about 3gb, one model
ashleyk
ashleykβ€’5mo ago
Happens A LOT. Happened to me at least 3 or 4 times in the last 6 months.
Jidovenok
Jidovenokβ€’5mo ago
probably even smaller
justin
justinβ€’5mo ago
I think for the 4090s the 24gb Pro, it happens a decent amount. I try to avoid it and go 24gb + 48gb gpu. Also if ur only 3gb build it into the image instead Ull get way way more flexibility and less of this issue to where i dont have problems with those endpoints with 10+ workes anything that is < 35gb I build into my model if it doesnt need dynamic switching
Jidovenok
Jidovenokβ€’5mo ago
Already using 24 + 24 pro. Where can i find more info about this method?
ashleyk
ashleykβ€’5mo ago
All 24GB PRO in RO are gone , thats why all my workers in RO are throttled, in a matter of WEEKS, it went from high availbility for 4090 to nothing and all my workers throttled
Jidovenok
Jidovenokβ€’5mo ago
And how long does it take to be resolved in average?
justin
justinβ€’5mo ago
When you select, select 1 on the 48pros, and 2 as the 24gb. Also, if you build the image into the model, and get off network storage, ull be able to use all data centers not just ones tied to network volume
ashleyk
ashleykβ€’5mo ago
Weeks, months, I move to a new endpoint
justin
justinβ€’5mo ago
I saw someone recently @kopyl who was throttled for an hour. so i suggest in ur situation, move to building the model into the image, and shouldnt be an issue
ashleyk
ashleykβ€’5mo ago
48GB PRO is low availability, I don't recommand
Jidovenok
Jidovenokβ€’5mo ago
The thing is i am using automatic1111 + custom model + LORAs
ashleyk
ashleykβ€’5mo ago
Same here
justin
justinβ€’5mo ago
Im just sharing what i have, i get high on 16gb, and 48pro at least for me with no network region
No description
justin
justinβ€’5mo ago
dockerhub lets u have one private repo that's what i do for my private stuff unless u have more stuff It always the 4090s that bottleneck me
ashleyk
ashleykβ€’5mo ago
WTF shows LOW for me without a network volume
No description
Jidovenok
Jidovenokβ€’5mo ago
So you manually push volumes to dockerhub and build from image directly?
justin
justinβ€’5mo ago
u could be right ashelyk, just found out im throttled across the board
No description
justin
justinβ€’5mo ago
No not push volumes to dockerhub U can just do some function call in ur dockerfile to download the model
ashleyk
ashleykβ€’5mo ago
Maybe became medium availability for a brief moment, workers are constantly moving around
Jidovenok
Jidovenokβ€’5mo ago
this is so frustrating)))
No description
Jidovenok
Jidovenokβ€’5mo ago
ok i see wym Thank you!
justin
justinβ€’5mo ago
yea i asked flash about this before, and its b/c someone can just eat up all the gpus for their super big clients. Something im debating on is if i get fully throttled across the board, i use their graphql endpoint to set a minimum of 2 active workers to steal back workers
justin
justinβ€’5mo ago
GitHub
GitHub - justinwlin/runpod-api: A collection of Python scripts for...
A collection of Python scripts for calling the RunPod GraphQL API - justinwlin/runpod-api
justin
justinβ€’5mo ago
@ashleyk got a repo on that It isnt an instant switch but better than getting fully throttled it seems to respect minimum workers and prioritze it
Jidovenok
Jidovenokβ€’5mo ago
And i will be able to use all data centers? The problem will be resolved or they still have this one sometimes even on the bigger amount of data centers?
justin
justinβ€’5mo ago
Ull be able to use all data centers and not locked to a region I think the problem will happen more rarely, @flash-singh supposedly has said if a worker is throttled for an hour, it terminates and switches it out, but that is crazy to me, why it would allow us to fall into an all worker throttle situation; also im not sure that really happens to be honest so i recommend maybe to explore the minimum worker force scenario, b/c i ping the /health on my endpoint routinely
justin
justinβ€’5mo ago
No description
justin
justinβ€’5mo ago
an ex of me pulling a minimum of 2 workers now to forcefully get my workers back
justin
justinβ€’5mo ago
maybe make ur numbers look like this
No description
justin
justinβ€’5mo ago
4090s are always eaten up, so should prob be the #3 or whatever the lowest number is tbh idk what the numbers even do 🀷🀷🀷 which i complained about too
flash-singh
flash-singhβ€’5mo ago
are you mostly looking for A5000s and 24gb mostly?
Jidovenok
Jidovenokβ€’5mo ago
yes
flash-singh
flash-singhβ€’5mo ago
EU-SE-1 is the best for that, EU-CZ-1 always has low quantity of those, and 3090s are always taken, were you looking for 3090s? are you able to move storage?
Jidovenok
Jidovenokβ€’5mo ago
we look for 24gb gpu, the model of gpu does not matter. I guess i can make a new storage in different data center
flash-singh
flash-singhβ€’5mo ago
you can either make a new endpoint, or switch your current one to use EU-SE-1, currently that one has the biggest capacity for 48gb and 24gb and 16gb but they do not have 4090s
justin
justinβ€’5mo ago
B/c he is a 3gb model, i think its better to just build into docker image in situations like that right? then he wouldn't be limited to a region? and he can also just take out EU-CZ-1 from his region list so he doesnt get assigned any there?
flash-singh
flash-singhβ€’5mo ago
yes i would never use network volume if your running 1 static model
Jidovenok
Jidovenokβ€’5mo ago
ty! will try the method above
flash-singh
flash-singhβ€’5mo ago
yep pick global and it will automatically pick most available servers across all regions EU-SE-1 has plenty of capacity but its also newer compared to most of other ones
justin
justinβ€’5mo ago
do u guys plan to make a chart or something detailing this informatino at some point 😦 πŸ˜… πŸ˜”πŸ˜”πŸ˜” or do we only have to get this anecdotally
flash-singh
flash-singhβ€’5mo ago
tbh im not using anything special, i just go click EU-SE-1 and see their all high but yes we do need to get better at showing availability, we also have a bug with network storage tab showing you wrong availability, we are working on fixing that this week i def understand the frustration, it causes us stress as well, but solving scale for GPUs, its more complicated and requires big investment, we are trying to push towards all directions to be better at this
justin
justinβ€’5mo ago
yeah still thx u runpod for making gpu / ml saas businesses a whole lot easier lol
flash-singh
flash-singhβ€’5mo ago
still many pain points as you can see, getting there by the day
ashleyk
ashleykβ€’5mo ago
By the way not using network storage doesn't even help, this endpoint of mine doesn't use any network storage and almost all my workers are throttled, this is a serious problem with 24GB GPU, basically zero availability anywhere.
No description
ashleyk
ashleykβ€’5mo ago
Massive problem, we have a stand at the PBX Expo in Las Vegas and this is impacting our product demonstations 😑 CC: @JM
No description
ashleyk
ashleykβ€’5mo ago
I don't understand, because if I edit my endpoint, it says "High Availability" for 24GB yet basically all my workers are throttled.
justin
justinβ€’5mo ago
Not sure if this helps / u prob already did it, but I had to reset my max workers to 0, and then back to 12, and kick out EU-CZ-1 so I dont get assigned any of the GPUs from that region. I think the big problem with Runpod's worker right now is that it seems to only stay on the first assigned GPU, and cause i had the same experience about after editing my endpoints I was also throttled fully until i forcefully refreshed all the workers back. Edit: could setting minimum workers temporarily if the stand is active, temporarily relieve the issue? x.x..
No description
No description
justin
justinβ€’5mo ago
/ @JM / @flash-singh hopefully can chime in tho .-. i also am confused what the best steps are in these situations; if we edit the endpoint do we need to refresh all the workers? what is the expected procedure..
ashleyk
ashleykβ€’5mo ago
Wow thats a major fail, if all my workers end up in CZ and get throttled, it should pick workers from somewhere else Good question, changing priority made zero difference, I had to scale workers down to zero and back up again which sucks
justin
justinβ€’5mo ago
Totally agree extremely frustrating I moved all my endpoints to kick cz-1 out so im not assigned a bad region cause the priority algorithm rlly is bad and seems to do nothing
ashleyk
ashleykβ€’5mo ago
I changed all my endpoints from 24GB to 48GB, 24GB tier is totally and utterly fucked up and completely unusable and nice how nobody from RunPod bothers to fucking respond when we have a fucking PRODUCTION ISSUE. THIS IS TOTALLY UNACCEPTABLE!!!!!!!!!!!!!!!!!!!!!!!! I am looking for a new provider in the morning, RunPod is utter shit if you can't get support. cc @Zeen
justin
justinβ€’5mo ago
https://discord.com/channels/912829806415085598/1209973235387474002 I agree, you guys need to change the priority algorithm, to something similar to my feedback. It at least needs to be visibly proactive trying to find workers, and start shifting at least two-three workers immediately out of throttle after like 5-10 seconds rather than letting it sit. Again, I have zero clue how the priority algorithm works, but we can't optimize anything to Runpod's specification cause there is nothing for us to specify. Honestly I'd even write my own priority algorithm if I could.
flash-singh
flash-singhβ€’5mo ago
can you share endpoint id? that seems like a bug
justin
justinβ€’5mo ago
Ill let @ashleyk ping his endpoint when he can, but b/c I experienced it too: qie98s97wqvw4t This one is mine. Ik ashelyk's is more production critical, but it seems like a bug with the priority algorithm then if me / him are both able to get fully throttled. I mean its fixed following the steps I said, to reset max workers to 0, shift my priorities around, kick CZ out, but I just wonder why I need to manually do this, and scale all my workers to 0 myself, rather than the priority algorithm handling this for me. Also if the editing of workers is sensed and updated, it should really try to recalculate all the throttled workers and begin to try to shift them over if there is avaliability, i think that is why ashelyk / i was confused, when editing out endpoint and nothing happens
flash-singh
flash-singhβ€’5mo ago
i see all 21 workers are idle, so whats likely happening is there is a huge spike of work which takes many gpus, and that slows down
justin
justinβ€’5mo ago
U said the throttle is switched out every hour before, is it possible to move 2-3 of them actively before that hour is hit? Also I think its b/c he refreshed all his workers https://discord.com/channels/912829806415085598/1209942179527663667/1209970269108707398 Where he had to scale them all to zero and back
flash-singh
flash-singhβ€’5mo ago
we will have to optimize that further but right now a huge spike will cause throttle and that will wind down after few mins this is showing all idle now
justin
justinβ€’5mo ago
I think this is a bug then, its not a few mins Yeah it is b/c he changed it but he obvs had the convoersation longer than 3 mins maybe ashelyk can share his graph at a closer time scale but im sure he got fully throttle
flash-singh
flash-singhβ€’5mo ago
got it, so he must reset the workers oh i do see throttle spike, then init spike so he must reset it
justin
justinβ€’5mo ago
Yeah, I guess, then my question is this a bug with the priority algorithm? What do u mean reset?
flash-singh
flash-singhβ€’5mo ago
set max to 0
justin
justinβ€’5mo ago
Okay, so there no way to do this automatically?
Zeen
Zeenβ€’5mo ago
it's not a bug as much as priority algo isn't good
flash-singh
flash-singhβ€’5mo ago
we do it automatically but it occurs hourly, will need to optimize that
Zeen
Zeenβ€’5mo ago
we're thinking to just allow users to set a quota per gpu type in addition to assigning launch priority what happened in the past few days is that a few of our larger customers flexed up 600+ serverless workers
justin
justinβ€’5mo ago
Is it possible to guarantee like a 2 worker minimum to do it immediately? I think that would even fix the current issues ANd also if someone manually changes it to start searching for new gpus if any are throttled? I guess the problem is that ashelyk had to manually scale to 0 in a production env if we could even scale down to half and scale back up that be nice
flash-singh
flash-singhβ€’5mo ago
yeah have to optimize that to take these conditions into account
justin
justinβ€’5mo ago
I see, i guess my next question is it possible for me to terminate workers through the graphql endpoint? https://graphql-spec.runpod.io/#definition-PodStatus Cause I want to write a script on my server to force minimum workers or terminate throttled workers if I have jobs in the queue, and I need it to be more proactive Do I treat it like a pod?
flash-singh
flash-singhβ€’5mo ago
yes its similar, i plan to optimize this either way
justin
justinβ€’5mo ago
Yeah i guess do u know when it will be estimated to be optimized? i guess im looking into it cause I want to start feeding it more requests soon to my LLM / stuff, but Ill write the script depending on the time frame to just have minimum workers dynamically set if i have to thank u tho, appreciate that the priority algorithm can be looked into / optimized / hopefully shared what its doing at some point too after reoptimized. I guess the fact that its an hour in a state of throttle, is a very badly known fact.
flash-singh
flash-singhβ€’5mo ago
whats your endpoint id? let me check logs for it
justin
justinβ€’5mo ago
I mean its not an issue for me, qie98s97wqvw4t b/c im not in a production env like ashelyk is, im just setting it up so that I can start testing > and moving my whole pipeline through cause I was relying on ChatGPT and it was costing it too much. But I commented in bc when this conversation started, and I wanted to share how not using a network volume could give u better avaliability: https://discord.com/channels/912829806415085598/1209942179527663667/1209946232131297320 I myself was throttled across the board in my to-be example of you shouldnt rely on network storage - but honestly, ive posted about this multiple times in the past too, and i guess as zeen said u guys have experienced insane uptick in the last 3 days
flash-singh
flash-singhβ€’5mo ago
planning on releasing optimizations tomorrow, have to tweak the knobs carefully otherwise it causes network issues
justin
justinβ€’5mo ago
Great, Im glad. If those release optimizations end up being done, do you think can tell us what it ends up being? So we know what to be aware of what the changes are? Thank you https://discord.com/channels/912829806415085598/1209973235387474002 Again, I think the biggest issue @ashleyk (and honestly even anyone else who would be using runpod in production) and why it wouldn't be taken srsly is b/c if u are fully throttled across the board and have no options to fix avaliability that really is the worst nightmare.
flash-singh
flash-singhβ€’5mo ago
ill share what i can here
justin
justinβ€’5mo ago
thanks! sorry for hammering u guys so much πŸ‘οΈ know there is a lot behind the scenes
flash-singh
flash-singhβ€’5mo ago
we are here to support, something we need to optimize regardless
ashleyk
ashleykβ€’5mo ago
So basically what you are saying is that money is more important to RunPod than providing a stable service to all customers that RunPod can increase the number of workers for larger customers to such an extent that it takes down the endpoints of all other customers? 😑 @flash-singh my endpoint was idle because 24GB tier is unusable and I had to change it 48GB tier and scale it down and back up again because editing the endpoint is shit and can't update automatically.
justin
justinβ€’5mo ago
Yeah, hopefully tho the coming changes that he proposes this week will fix it https://discord.com/channels/912829806415085598/1209973235387474002/1210002895781625907 Definitely is an issue that I think they will work to address, and let's see where it goes. i am glad to see that the hour throttle will drop down to 4 mins to start swapping things around + allow movement with less restrictions so hopefully runpod's algorithm will be a heck lot more proactive
Zeen
Zeenβ€’5mo ago
No we had an internal discussion and all agreed that the quota shouldn't have been increased in this case.
Jidovenok
Jidovenokβ€’5mo ago
@flash-singh i just want to thank you for your job and your product. Despite some throttling problems our company really appreciates the desire to fix problems instead of ignoring customers as most support team do
teddycatsdomino
teddycatsdominoβ€’5mo ago
I have a few questions here. What exactly is best practice when availability runs low in the region where we have a network volume. Should we keep endpoints active in multiple regions? On a similar note, is there a best practice regarding when to use a network volume and when to bundle models into our image? If we have 20gb of models, should that all just be bundled or should we be using a network volume?
justin
justinβ€’5mo ago
I think this should be bundled, tbh. I find < 30gb for the compressed image shown on dockerhub quite safe, this is an example of my Mistral one. https://hub.docker.com/layers/justinwlin/mistral7b_openllm/latest/images/sha256-47f901971ee95cd0d762fe244c4dd625a8bf7a0e0142e5bbd91ee76f61c8b6ef?context=repo Haha, I saw you respond in the different thread, but Ill continue to answer here The number just comes from trial and error anecdotally If you get too high, the download time to serverless initialization becomes impossible. So I find that < 30gb is reasonable first initialization time. Once you start pushing that boundary, I just find it personally a bit weird.
teddycatsdomino
teddycatsdominoβ€’5mo ago
Ok, I'll give it a shot. That implies that I could ditch the network volume and use the global region which should help tremendously with availability.
justin
justinβ€’5mo ago
The runpod base image, is what I tend to use, so there is some cost there, but if you want to optimize it to the core, I saved maybe 1-2 gbs, not using the runpod-pytorch as a starting point. https://github.com/justinwlin/runpodWhisperx/blob/master/Dockerfile But tbh, nowadays i just end up building on it cause it saves me a lot of headache: https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/Dockerfile Yeah, u wont get locked in per region. Also another thing is the priorities do matter. It tries to assign u a lot of whatever you put as (1) when you first initialize, I try to put a (1) priority on 24gb, or 48gb, but not on 24gb pro. The 24gb pro and 48gb is like very similar in cost, but the 24gb pro just isn't worth the headaches it gives I also got rid of the EU-CZ-1 region, cause I dont want to get assigned any GPUs from there cause that region got some avaliability issues around the 24gb pro it seems. Im sure the changes Flash is making will get throttled workers to move around way better, but I rather just not deal with it
justin
justinβ€’5mo ago
No description
justin
justinβ€’5mo ago
example what i mean
justin
justinβ€’5mo ago
No description
teddycatsdomino
teddycatsdominoβ€’5mo ago
This is helpful, thank you Justin!
Dextro
Dextroβ€’5mo ago
Still encountering this issue trying to get 4090s as of this afternoon:
No description
ashleyk
ashleykβ€’5mo ago
Yep, all my workers are throttled again too, RunPod serverless is pretty unusable at the moment
ashleyk
ashleykβ€’5mo ago
I even have 2 different endpoints in different regions and they are both throttled
No description
flash-singh
flash-singhβ€’5mo ago
4090s are too high in demand right now and more supply will be added in 1-2 weeks
dudicious
dudiciousβ€’5mo ago
48gbs were all throttled in CA today too.
ashleyk
ashleykβ€’5mo ago
Yes my endpoints are 48GB in SE and CA and both fully throttled. Also my 24GB without network storage and thus no region affinity also fully throttled. Serverless is a joke. I'm an enterprise customer but all my endpoints are fully throttled and cannot get support from RunPod so I'm taking my business elsewhere because this is totally unacceptable @flash-singh @Zeen @JM
Zeen
Zeenβ€’5mo ago
Hey, I know not much I can say after the fact can fix past pain, but we have made a few platform releases to improve the throttling in the past day as well as added more capacity (way more coming next week). We've got a lot of customer using serverless and we've experience a spike in consumption usage that is just enormous and we're trying our best to handle it. We apologize for affecting your business and we are trying our best to find a balance between action and messaging.
dudicious
dudiciousβ€’5mo ago
Still getting throttled constantly. Serverless doesn't seem viable in its current state. Bummer. The tech is cool.
Baran
Baranβ€’5mo ago
It’s insane to me that I’m just getting throttled out of the blue without a heads-up All of my workers just won’t start and every previously working GPU is now unavailable This happened yesterday in EU-SE1 and now today in EUR-NO-1 What’s happening? @Zeen @flash-singh @JM
ashleyk
ashleykβ€’5mo ago
Looks like RunPod may have fixed something aroung 3.5 hours ago, all my endpoints throttled workers seem to have recovered around the same time.
No description
ashleyk
ashleykβ€’5mo ago
Looks like I spoke too soon, they looked better for a short while, now getting throttled again.
No description
Baran
Baranβ€’5mo ago
This sucks
ashleyk
ashleykβ€’5mo ago
Basically no GPUs available in NO, SE has some 16GB and 24GB
ashleyk
ashleykβ€’5mo ago
SE
No description
ashleyk
ashleykβ€’5mo ago
NO
No description
ashleyk
ashleykβ€’5mo ago
I don't understand whats going on though because in NO I have no throttled workers.
marshall
marshallβ€’5mo ago
same issue with throttled workers... personally think RunPod has to scale up at this point ASAP previously we can stand on just using A5000s and only 4090s were in throttling hell... but now, even that is throttled indefinitely the issue has been happening for several days now, and the obvious solution of "just use 'active workers' " isn't really viable at our small scale, because doing that would be just like paying for the machines directly... we are running a community supported project
Baran
Baranβ€’5mo ago
The lack of communication is really concerning
luckedup.
luckedup.β€’5mo ago
Same here, running production site. This happened to me before (I moved from US to EU) for availability and now it happened in EU again.
flash-singh
flash-singhβ€’5mo ago
we have tweaked the algos but at certain points in the day the spikes eat up all the capacity, we are adding more gpus this week for A5000 and 4090s
ashleyk
ashleykβ€’5mo ago
I think you need to add more network capacity too, too many machines on the same network seems to be causing issues where everyone is experiencing slow speeds, serverless getting connection timed out issues, peoples pods disappearing etc etc. I just had to terminate workers for an endpoint because they were getting stuck for 5mins on a job that takeds 14 seconds, due to network connectivity issues. Then a new worker spawned and also got stuck eating up all my credits and the job doesn't even get processed, it gets stuck on IN_PROGRESS. My manager has demanded a refund for this because its unacceptable.
marshall
marshallβ€’5mo ago
I just had to terminate workers for an endpoint because they were getting stuck for 5mins on a job that takeds 14 seconds, due to network connectivity issues. Then a new worker spawned and also got stuck eating up all my credits and the job doesn't even get processed, it gets stuck on IN_PROGRESS. My manager has demanded a refund for this because its unacceptable.
This also happens to us... we were getting charged for 10+ minutes for a worker that kept "queueing image for pull" and the job was still IN_QUEUE... I was gonna report it but I didn't know if we were actually being charged or if it was just a UI thing we chewed through $3 of credits in ~24 hours when we usually only spend $0.74/day as per our size... and our jobs only took 2-3s it actually happened twice, and that was when I was there to see it... so it's definitely been doing that multiple times per hour
ashleyk
ashleykβ€’5mo ago
@marshall are you using latest tag for your Docker image?
marshall
marshallβ€’5mo ago
we have our own tagging system that tags images based on the commit message I don't think it's very much relevant to the issue, but the tag was sm-q, hosted on our private docker registry
ashleyk
ashleykβ€’5mo ago
Is it possible to push a new image to the same tag?
marshall
marshallβ€’5mo ago
I guess so-? but runpod caches the images per-datacenter, so that usually just happens in development... which is why we have semver for dev images the image pulls just fine and we use it in prod, the issue is on the worker's side... infinitely "queuing image for pull" and us getting charged for a job that's not even in progress the issue occured again:
2024-02-25T18:28:18Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:28:34Z create container ***/serverless-llm:sm-q
2024-02-25T18:28:34Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:28:51Z create container ***:sm-q
2024-02-25T18:28:51Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:29:07Z create container ***/serverless-llm:sm-q
2024-02-25T18:29:07Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:28:18Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:28:34Z create container ***/serverless-llm:sm-q
2024-02-25T18:28:34Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:28:51Z create container ***:sm-q
2024-02-25T18:28:51Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
2024-02-25T18:29:07Z create container ***/serverless-llm:sm-q
2024-02-25T18:29:07Z error creating container: Error response from daemon: Conflict. The container name "/4e07zi5dmuki3w-0" is already in use by container "867bebfa2354330a40a65a3a3e53cda8a539d19326266fbea0cb3419bd1599d3". You have to remove (or rename) that container to be able to reuse that name.
it's been doing that for 3 minutes. and we're getting charged for it... so far in the past 30 minutes, runpod has chewed through 10 cents if we calculate how much requests that would've been: 0.1 / (0.00026 * 3), it means we should've been receiving ~128.21 requests in that past 30 minutes
marshall
marshallβ€’5mo ago
this doesn't look like 128 requests to me:
No description
No description
marshall
marshallβ€’5mo ago
not even close @flash-singh sorry for the direct ping but uh, it's actually chewing through our balance, another 8 cents has just been deducted. what do we do? 2 cents deducted out of nowhere, there are no jobs running across all endpoints
flash-singh
flash-singhβ€’5mo ago
its just 1 worker? terminate that for now, ill look into the bug
marshall
marshallβ€’5mo ago
we tried setting max worker count to 8 to try and see if that will improve the delay time... it didn't
flash-singh
flash-singhβ€’5mo ago
due to throttled workers?
marshall
marshallβ€’5mo ago
Yupp
flash-singh
flash-singhβ€’5mo ago
higher max workers can help but ideally much of the compute is saturated and expansion is already planned this week for some gpus
marshall
marshallβ€’5mo ago
What we're also thinking is that it might be deducting from cancelled jobs the timer goes up each refresh, and these jobs were previously cancelled due to them taking too long... and our systems just cancelled them to prevent too much usage... the timeout is set to 120s (queueing included)
No description
No description
flash-singh
flash-singhβ€’5mo ago
cancelled wont charge once triggered, we stop those workers running the job
marshall
marshallβ€’5mo ago
holy crap
No description
marshall
marshallβ€’5mo ago
I think the best way to go for now is to shutdown our AI chatbot feature until this infrastructure issue is fixed we can't have our contributors' money wasted over runpod's scaling issue if this goes unwatched, who knows how much money it'll siphon out and we aren't certain if we're going to get refunded for this
marshall
marshallβ€’5mo ago
tried contacting sales... welp.
No description
marshall
marshallβ€’5mo ago
currently trying to run a smaller version of our model on 16GB temporarily 1 week of downtime is too big of an impact for us apparently
HyS | The World of Ylvera
@marshall hey was your issue ever resolved? I looked through my logs and saw a sudden huge spike in credit consumption for just a couple jobs. It looks like the "delay" time it took to even run the job was counted into the actual gpu usage :T I'd like to add it was also on the same dates as your issues. Feb 24/25
marshall
marshallβ€’5mo ago
got in touch with sales, they gave back the burned credits based on our 30 day average Right now we're running the model on 16GB which is a bit more expensive due to the longer inference time (despite being 30% cheaper, the model took 60% longer to produce output) so ideally we should go back to 24GB, but we'll have to wait for RunPod's announcement regarding GPU availability... According to sales:
"It's probably going to be a gradient over time rather than a binary state of being resolved/not resolved since we add more capacity on a weekly/biweekly basis; we do announce big supply adds on Discord when they come through so that's probably the best way to keep updated"
which is their answer when I asked "if/when the issue would get resolved"
HyS | The World of Ylvera
Thanks a ton for the response! I contacted them directly as well for now. Good to hear your side got (mostly? kinda?) resolved :]
marshall
marshallβ€’5mo ago
Still not fully resolved but at least they refunded the credits xd
HyS | The World of Ylvera
Job execution times are normal, but the delay time caused a huge spike in credit consumption :[
No description
No description
HyS | The World of Ylvera
Good to hear they refunded your side. Hoping for the same
ashleyk
ashleykβ€’5mo ago
How do you contact sales? I need to contact them for a refund too..
HyS | The World of Ylvera
I used their chat on their site. It's in the lower bottom right
JM
JMβ€’5mo ago
Hey @marshall @HyS | The World of Ylvera @ashleyk I onboarded a huge load of hardware. However, the minimum RunPod should be able to do, is provide high quality communication, which I see wasn't ideal. Zhen, Pardeep, Justin and me have been pushing hard on at least 5 different features to make Serverless much better at managing huge loads. Secondly, we hired 3 support staff, 2 cloud engineers, and looking for more support engineers as well. Communications must improve; and it will, trust me. That being said, we value relationship above all else. All else. Hit me up in private and we will provide compensation for you.
marshall
marshallβ€’5mo ago
That's a great resolution!
HyS | The World of Ylvera
For now I dmed you. Thank you for the ping!
marshall
marshallβ€’5mo ago
moved into DMs
JM
JMβ€’5mo ago
Sure, thanks both!