R
RunPod6mo ago
Monster

not enough GPUs free

Hi there, wish you a good day today. I have a serverless endpoint running on runpod, it is created on top of the network storage belongs to US-OR-1 data center. it was running well for somedays, but 20 mins before, I have encountered the issue that no worker is able to be created because no GPU resource. the system throws a log like this repeatedly. 2024-07-13T06:32:22Z create container USERNAME/ENDPOINT 2024-07-13T06:32:22Z error creating container: not enough GPUs free how can I make sure there are GPU resources whenever the request comes, should I change the endpoint and the network volume to other region which has more GPU resoures? how often this shortage will be happening. it post a risk on the stability and quality of service which is critical in most scenarios. thank you.
22 Replies
nerdylive
nerdylive6mo ago
Try to report it to runpod from contact but for now, try to change your env variable ( add dummy any ) or update your template image to other tag /version
Monster
MonsterOP6mo ago
I did not get it. the image version is my customized version on dockerhub, it might be V1, V2, V3 any thing, how does it related to the GPU resource competing? and which env file should I edit, to add the dummy any? how. thank you so much.
nerdylive
nerdylive6mo ago
add a dummy env variable from edit template
nerdylive
nerdylive6mo ago
No description
nerdylive
nerdylive6mo ago
it re makes worker when you do that
Monster
MonsterOP6mo ago
any thing? like ENVDUMMY=anything
nerdylive
nerdylive6mo ago
add an env, anything : anything yeas then save it'll redeploy
Monster
MonsterOP6mo ago
ok, so you suggest it is not actually a resource lackage, it is a bug that is why I should add the dummy env and or update image version
nerdylive
nerdylive6mo ago
yea probably\ for redeploying
Monster
MonsterOP6mo ago
ok, thx
nerdylive
nerdylive6mo ago
Your welcome @Monster Did that work?
Monster
MonsterOP6mo ago
I have added a dummy env. it did work, but it does not mean the problem solved, since the error gone after 2 or 3 mins by itself, after couple of times trying new worker. so, any idea or suggestion how to make it not happening again? thx
nerdylive
nerdylive6mo ago
I think its more like from their internal bug so i think the best way is to report it to runpod via support ticket on the website so they can manage to fix it Try submitting one if you havent 👍
Monster
MonsterOP6mo ago
yes, I did report. thank you so much for the support.
nerdylive
nerdylive6mo ago
no problem im happy its solved for now alright great lets just wait for their response
Monster
MonsterOP6mo ago
"they" what do you mean "they", I thought you are from runpod support team, isn't you
nerdylive
nerdylive6mo ago
No im not an official support team member
Monster
MonsterOP6mo ago
so you are hired by them or you are the volunteer to give support based on your experience.
nerdylive
nerdylive6mo ago
Not both, but im invited as an community helper here I can't really access runpod's internal systems, nor your resources Whats up? why are you asking these hahah
Monster
MonsterOP6mo ago
I feel you are very confident and familar with all sort of issue, platforms, technologies. at least would be a senior member of their support team. so it supprised me that you do not have access to runpod internal
nerdylive
nerdylive6mo ago
Yes, i've been here quite long and also an active developer ( not at runpod ) 😆 i hope someday maybe i can join Runpod
Monster
MonsterOP6mo ago
yes, you will for sure. thx anyway.
Want results from more Discord servers?
Add your server