Serverless broke for me overnight, I can't get inference to run at all.
Hi, I was using
runpod/worker-vllm:stable-cuda12.1.0
in my production app with the model TheBloke/dolphin-2.7-mixtral-8x7b-AWQ
. There appears to have been an update in the last 24 hours or so that broke my app completely. I have since spent the last six hours trying to get ANYTHING out of ANY endpoint, and I just can't get anything running. Prior to today, this was running uninterrupted for over a month. I have tried:
- Rolling back to runpod/worker-vllm:0.3.1-cuda12.1.0
- Swapping out models; tried easily 8 or 9 different ones, mostly mixtral variants. I have tried AWQ, GPTQ and unquantized models.
Logs and observations in thread (post was too long)66 Replies
logs in attachment
And then just nothing in either log, ever again. No errors, nothing. Same result on the new vllm stable version. Manual requests made using the tool on this page immediately go into the "IN_QUEUE" state, and never ever return. Nothing is reflected in the logs to indicate that a request was even made.
the gpu utilisation or memory usage never goes up either, this implies to me that its not even loading the model.

my envrionment variables. I've been messing with these all day, but i'm fairly certain this is the state they were in before today

my endpoint config, which is definitely the same that i had yesterday.

i've spent six hours on this so far today. Is there anything obvious that i'm missing?
I'm not even getting errors that I can action.
I just found that the "logs" in the middle of the page has slightly different information than the logs on the worker at the top of the page.
Those messages aren't clear to me, I don't know what action I can take to remedy them. Are they even errors? SIGTERM is a request to terminate a program. Maybe its terminating and then not listening for requests?
Even if i turn the "execution timeout" on, it gets ignored.
just found this in the relevant inbox, i believe this was the original issue. i think i am still suffering from it.

(email arrived around 15 hours ago, i've been aware of the issue and trying to trouble shoot it for eight hours straight now)
Hahaha finally
i still havent been able to solve the issue yet, i cant get any inference to run at all
So i've spent around 15 hours today troubleshooting this issue. it has been the single most frustrating day of my life. i still cant get any inference to run on serverless endpoints at all. Support's response was "i can't get it to work either, make an issue on github". Its time for me to leave runpod behind and go somewhere else.
@Alpay Ariyak any idea?
Hi, the stable image is still 0.3.2, which is the same image it was before yesterday, I had to reupload it bc a github action tried to push main branch as stable
Investigating this now
Thanks for the ping @digigoblin
I think support should assign vllm support issues to you if they can't figure out the problem rather than telling people to log an issue on Github.
Yeah good idea, will bring it up
I think I may have figured the issue out
Patching ASAP
i'm keen to know what you've found
I believe the wrong base image might've been used during the build somehow, rebuilding everything
thank you, i really appreciate it
Yeah ofc! Sorry for this experience and getting back to you late, I'm currently in the EST timezone
its ok, its almost 2am here, i have no idea about american timezones, but i'm used to waiting a day for support responses for basically any service. i've just been trying to solve it myself all day
not solve it, more like work around it i guess
Ahh got you, I see, it's 11am here currently
I'm testing
alpayariyakrunpod/worker-vllm:stable-cuda12.1.0
now with your endpoint configurationNow that RunPod has received additional investment, there should be support staff across timezones. I also have to regularly wait several hours for response to production issues. RunPod has customers all over the world, not just US, so staff shouldn't all be based in the US.
There are people like @Papa Madiator who are available within the other time zones but his access is too restricted and he can't help with more complex issues.
That might be changing soon π
Okay, what I think happened:
The requirements.txt files in the vllm build don't pin versions, so when I rebuilt stable yesterday due to the original being replaced by github automatic build, it installed newer versions of those packages and that broke something
Luckily, the original version of cuda 11.8 version of 0.3.2(stable) , remained, so I was able to pull it and grab all of the package versions
thats good news π
Now rebuilding with hopefully correct versions
thank you
Ofc, thank you for your patience
theres always a silver lining, i learned an absolute ton about vllm, awq, gptq and skypilot today
i see a push to the git repo just now
For sure, glad to hear that
Unfortunately, that fix didn't seem to help, so I'm trying to see if the new update will work with that config and setting it as stable
Even with the new update, also stuck on started ray worker and 1% memory
trying to enable enforce eager and trust remote code now
one thing i've seen repeatedly today; even using worker-vllm:0.3.2-cuda12.1.0 didn't work, which if my understanding is correct, hasn't been changed since march, and should be the exact image that was worker-vllm:stable-cuda12.1.0 before yesterday, right?
worker-vllm:0.3.2-cuda12.1.0 was rebuilt and repushed
ahh, that makes sense
worker-vllm:0.3.2-cuda11.8.0 wasn't, I'll try that
i tried 8 or 9 models across different sizes, quants and architectures(?) and every combination of settings, environment variables and versions i could think of. same pause at ray worker each time. i did NOT try cuda11.8.0
All of that was multi-gpu on A40s?
yeah, 2x a40 or a6000 every time, i never changed that variable
even on the tiny models i kept that the same
same issue on the unchanged stable 11.8.0
Gotta love the lack of logs in the ray initialization
makes me think of this one then, some kind of network issue that never fully resolved maybe
thats what kills me, it doesnt give me anything actionable at all
I feel your pain
in your testing today did you use network storage volumes at all? i've been using EU-SE-1 exclusively, thats another variable that i haven't changed
In the past, what fixed this (specifically for multi-gpu) was using physical CPU count to initialize ray
It does that by default now, but I'm gonna try lowering the amount of CPUs used, I set up an env var for it VLLM_CPU_FRACTION
No
but if the cuda11.8.0 images havent changed, and they're broken too, doesn't that effectively rule out basically everything in the images?
Likely so, but I'm not sure how the machines could have changed in a way that affects this either, so trying to exhaust all possible options on worker code level
thanks for your attention on this. Its after 3am on saturday morning here now, I'm too old to pull an all nighter these days. I'm going to go grab a few hours sleep. thanks again
btw, with yourself and the support person both able to replicate this so easily; are there no other customers with the same issue? if its working for someone, maybe its worth comparing notes to find out whats different with their setup.
Sounds good, will fix this by the time youβre back up, thanks!
Iβm guessing not that many people are doing multi-gpu, the issue is contained to that scenario
Yes not sure if this is connected to serverless, but I have been doing dev work on vllm in a pod on the secure cloud. And within the last 1-2 days also have been stuck on Ray initialization/worker creation
I am using the exact same commands and installation as just a few days ago, which worked fine
tried on multiple different GPUs
this is with multi-GPU setup on vllm
this is incredibly useful, I've noticed the same on secure cloud yesterday
Before this, did it work?
yes, it did
I did notice these warnings from vllm that is not present on baremetal machines that had no problem starting ray (However I don't remember if they have always been there)

full output and cmd where I saw this:

@Alpay Ariyak More details I remember that may be helpful, I first started experiencing hanging ray init on EU-SE-1 A4000/A5000 instances. At the same time ray init was working fine on US-OR-1 A100 SXM instances
at some point yesterday(?) ray init stopped working on both
Thanks a lot @maple for confirming this, it indeed is a wider issue affecting all machines and unrelated to worker vLLM
Related to a machine agent release that was made yesterday, the team is working on rolling it back ASAP
It's absolutely terrible that production is broken as a result, but I'm glad to know now it wasn't anything I did with Worker vLLM, I was driving myself crazy trying to figure out what I did that could have caused it, as all leads led to dead ends haha - the timing with the repushed worker image was just too perfect for it to be the main suspect
Great, could you please let me know when this is rolled back?
Yes of course
it should be live in less than 30m
can you DM me your runpod email? We'll figure out some comp for this - really sorry for the issues this caused
no rush on that ofc
is there anything i might need to do at my end to get it running again? I just activated a worker on the endpoint, and it did actually load the model into memory. Which is way further than I got at any point yesterday. but it still not running inference; the requests are still stuck at IN_QUEUE.
I'm about to start playing with my environment variables again in case they're in an invalid state
YES! I finally got some inference output!
my app is back up and running! Only 26 hours of downtime and 186 new signups hit with "Sorry we're down"
@Alpay Ariyak thanks for your hard work with this π
Hey, why do I get this on your site? Is it my phone only

no idea, i'll see if i can replicate it. i haven't seen that issue myself, we have a letsencrypt ssl cert.
thanks for the heads up
Alright no problem
i'm not able to replicate it across OSes (iOS, Android, MacOS), browsers (Chrome, FF, Safari), and networks (cell, wifi or protonvpn). Is it possibly your VPN? We dont have any third party analytics or ads or anything. Only sentry for error tracking, and sentry is only on the server side.
I will retry it later after clearing app cache
I think sentry Is fine on my vpn even I client side, maybe cache problem
thanks for the heads up, i'll keep an eye out for anyone else having similar problems too
Oh wait it's my vpn? It's blocking your site hahah
It was actually
@Bumchat was trying to help as much as I could though I do not have yet access to debug hardware stuff and not used vLLM worker much. Though thank you for having patience and also reporting such an issue π .
Though if you get more issues feel free to ping me any time.