R
RunPodβ€’2mo ago
Bumchat

Serverless broke for me overnight, I can't get inference to run at all.

Hi, I was using runpod/worker-vllm:stable-cuda12.1.0 in my production app with the model TheBloke/dolphin-2.7-mixtral-8x7b-AWQ. There appears to have been an update in the last 24 hours or so that broke my app completely. I have since spent the last six hours trying to get ANYTHING out of ANY endpoint, and I just can't get anything running. Prior to today, this was running uninterrupted for over a month. I have tried: - Rolling back to runpod/worker-vllm:0.3.1-cuda12.1.0 - Swapping out models; tried easily 8 or 9 different ones, mostly mixtral variants. I have tried AWQ, GPTQ and unquantized models. Logs and observations in thread (post was too long)
66 Replies
Bumchat
Bumchatβ€’2mo ago
logs in attachment
Bumchat
Bumchatβ€’2mo ago
And then just nothing in either log, ever again. No errors, nothing. Same result on the new vllm stable version. Manual requests made using the tool on this page immediately go into the "IN_QUEUE" state, and never ever return. Nothing is reflected in the logs to indicate that a request was even made.
Bumchat
Bumchatβ€’2mo ago
the gpu utilisation or memory usage never goes up either, this implies to me that its not even loading the model.
No description
Bumchat
Bumchatβ€’2mo ago
my envrionment variables. I've been messing with these all day, but i'm fairly certain this is the state they were in before today
No description
Bumchat
Bumchatβ€’2mo ago
my endpoint config, which is definitely the same that i had yesterday.
No description
Bumchat
Bumchatβ€’2mo ago
i've spent six hours on this so far today. Is there anything obvious that i'm missing? I'm not even getting errors that I can action.
Bumchat
Bumchatβ€’2mo ago
I just found that the "logs" in the middle of the page has slightly different information than the logs on the worker at the top of the page.
Bumchat
Bumchatβ€’2mo ago
Those messages aren't clear to me, I don't know what action I can take to remedy them. Are they even errors? SIGTERM is a request to terminate a program. Maybe its terminating and then not listening for requests? Even if i turn the "execution timeout" on, it gets ignored.
Bumchat
Bumchatβ€’2mo ago
just found this in the relevant inbox, i believe this was the original issue. i think i am still suffering from it.
No description
Bumchat
Bumchatβ€’2mo ago
(email arrived around 15 hours ago, i've been aware of the issue and trying to trouble shoot it for eight hours straight now)
nerdylive
nerdyliveβ€’2mo ago
Hahaha finally
Bumchat
Bumchatβ€’2mo ago
i still havent been able to solve the issue yet, i cant get any inference to run at all So i've spent around 15 hours today troubleshooting this issue. it has been the single most frustrating day of my life. i still cant get any inference to run on serverless endpoints at all. Support's response was "i can't get it to work either, make an issue on github". Its time for me to leave runpod behind and go somewhere else.
digigoblin
digigoblinβ€’2mo ago
@Alpay Ariyak any idea?
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
Hi, the stable image is still 0.3.2, which is the same image it was before yesterday, I had to reupload it bc a github action tried to push main branch as stable Investigating this now Thanks for the ping @digigoblin
digigoblin
digigoblinβ€’2mo ago
I think support should assign vllm support issues to you if they can't figure out the problem rather than telling people to log an issue on Github.
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
Yeah good idea, will bring it up I think I may have figured the issue out Patching ASAP
Bumchat
Bumchatβ€’2mo ago
i'm keen to know what you've found
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
I believe the wrong base image might've been used during the build somehow, rebuilding everything
Bumchat
Bumchatβ€’2mo ago
thank you, i really appreciate it
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
Yeah ofc! Sorry for this experience and getting back to you late, I'm currently in the EST timezone
Bumchat
Bumchatβ€’2mo ago
its ok, its almost 2am here, i have no idea about american timezones, but i'm used to waiting a day for support responses for basically any service. i've just been trying to solve it myself all day not solve it, more like work around it i guess
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
Ahh got you, I see, it's 11am here currently I'm testing alpayariyakrunpod/worker-vllm:stable-cuda12.1.0 now with your endpoint configuration
digigoblin
digigoblinβ€’2mo ago
Now that RunPod has received additional investment, there should be support staff across timezones. I also have to regularly wait several hours for response to production issues. RunPod has customers all over the world, not just US, so staff shouldn't all be based in the US. There are people like @Papa Madiator who are available within the other time zones but his access is too restricted and he can't help with more complex issues.
Madiator2011
Madiator2011β€’2mo ago
That might be changing soon πŸ™‚
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
Okay, what I think happened: The requirements.txt files in the vllm build don't pin versions, so when I rebuilt stable yesterday due to the original being replaced by github automatic build, it installed newer versions of those packages and that broke something Luckily, the original version of cuda 11.8 version of 0.3.2(stable) , remained, so I was able to pull it and grab all of the package versions
Bumchat
Bumchatβ€’2mo ago
thats good news 😁
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
Now rebuilding with hopefully correct versions
Bumchat
Bumchatβ€’2mo ago
thank you
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
Ofc, thank you for your patience
Bumchat
Bumchatβ€’2mo ago
theres always a silver lining, i learned an absolute ton about vllm, awq, gptq and skypilot today i see a push to the git repo just now
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
For sure, glad to hear that Unfortunately, that fix didn't seem to help, so I'm trying to see if the new update will work with that config and setting it as stable Even with the new update, also stuck on started ray worker and 1% memory trying to enable enforce eager and trust remote code now
Bumchat
Bumchatβ€’2mo ago
one thing i've seen repeatedly today; even using worker-vllm:0.3.2-cuda12.1.0 didn't work, which if my understanding is correct, hasn't been changed since march, and should be the exact image that was worker-vllm:stable-cuda12.1.0 before yesterday, right?
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
worker-vllm:0.3.2-cuda12.1.0 was rebuilt and repushed
Bumchat
Bumchatβ€’2mo ago
ahh, that makes sense
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
worker-vllm:0.3.2-cuda11.8.0 wasn't, I'll try that
Bumchat
Bumchatβ€’2mo ago
i tried 8 or 9 models across different sizes, quants and architectures(?) and every combination of settings, environment variables and versions i could think of. same pause at ray worker each time. i did NOT try cuda11.8.0
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
All of that was multi-gpu on A40s?
Bumchat
Bumchatβ€’2mo ago
yeah, 2x a40 or a6000 every time, i never changed that variable even on the tiny models i kept that the same
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
same issue on the unchanged stable 11.8.0 Gotta love the lack of logs in the ray initialization
Bumchat
Bumchatβ€’2mo ago
makes me think of this one then, some kind of network issue that never fully resolved maybe thats what kills me, it doesnt give me anything actionable at all
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
I feel your pain
Bumchat
Bumchatβ€’2mo ago
in your testing today did you use network storage volumes at all? i've been using EU-SE-1 exclusively, thats another variable that i haven't changed
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
In the past, what fixed this (specifically for multi-gpu) was using physical CPU count to initialize ray It does that by default now, but I'm gonna try lowering the amount of CPUs used, I set up an env var for it VLLM_CPU_FRACTION No
Bumchat
Bumchatβ€’2mo ago
but if the cuda11.8.0 images havent changed, and they're broken too, doesn't that effectively rule out basically everything in the images?
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
Likely so, but I'm not sure how the machines could have changed in a way that affects this either, so trying to exhaust all possible options on worker code level
Bumchat
Bumchatβ€’2mo ago
thanks for your attention on this. Its after 3am on saturday morning here now, I'm too old to pull an all nighter these days. I'm going to go grab a few hours sleep. thanks again btw, with yourself and the support person both able to replicate this so easily; are there no other customers with the same issue? if its working for someone, maybe its worth comparing notes to find out whats different with their setup.
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
Sounds good, will fix this by the time you’re back up, thanks! I’m guessing not that many people are doing multi-gpu, the issue is contained to that scenario
maple
mapleβ€’2mo ago
Yes not sure if this is connected to serverless, but I have been doing dev work on vllm in a pod on the secure cloud. And within the last 1-2 days also have been stuck on Ray initialization/worker creation I am using the exact same commands and installation as just a few days ago, which worked fine tried on multiple different GPUs this is with multi-GPU setup on vllm
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
this is incredibly useful, I've noticed the same on secure cloud yesterday Before this, did it work?
maple
mapleβ€’2mo ago
yes, it did
maple
mapleβ€’2mo ago
I did notice these warnings from vllm that is not present on baremetal machines that had no problem starting ray (However I don't remember if they have always been there)
No description
maple
mapleβ€’2mo ago
full output and cmd where I saw this:
No description
maple
mapleβ€’2mo ago
@Alpay Ariyak More details I remember that may be helpful, I first started experiencing hanging ray init on EU-SE-1 A4000/A5000 instances. At the same time ray init was working fine on US-OR-1 A100 SXM instances at some point yesterday(?) ray init stopped working on both
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
Thanks a lot @maple for confirming this, it indeed is a wider issue affecting all machines and unrelated to worker vLLM Related to a machine agent release that was made yesterday, the team is working on rolling it back ASAP It's absolutely terrible that production is broken as a result, but I'm glad to know now it wasn't anything I did with Worker vLLM, I was driving myself crazy trying to figure out what I did that could have caused it, as all leads led to dead ends haha - the timing with the repushed worker image was just too perfect for it to be the main suspect
maple
mapleβ€’2mo ago
Great, could you please let me know when this is rolled back?
Alpay Ariyak
Alpay Ariyakβ€’2mo ago
Yes of course
Zeen
Zeenβ€’2mo ago
it should be live in less than 30m can you DM me your runpod email? We'll figure out some comp for this - really sorry for the issues this caused no rush on that ofc
Bumchat
Bumchatβ€’2mo ago
is there anything i might need to do at my end to get it running again? I just activated a worker on the endpoint, and it did actually load the model into memory. Which is way further than I got at any point yesterday. but it still not running inference; the requests are still stuck at IN_QUEUE. I'm about to start playing with my environment variables again in case they're in an invalid state YES! I finally got some inference output! my app is back up and running! Only 26 hours of downtime and 186 new signups hit with "Sorry we're down" @Alpay Ariyak thanks for your hard work with this πŸŽ‰
nerdylive
nerdyliveβ€’2mo ago
Hey, why do I get this on your site? Is it my phone only
No description
Bumchat
Bumchatβ€’2mo ago
no idea, i'll see if i can replicate it. i haven't seen that issue myself, we have a letsencrypt ssl cert. thanks for the heads up
nerdylive
nerdyliveβ€’2mo ago
Alright no problem
Bumchat
Bumchatβ€’2mo ago
i'm not able to replicate it across OSes (iOS, Android, MacOS), browsers (Chrome, FF, Safari), and networks (cell, wifi or protonvpn). Is it possibly your VPN? We dont have any third party analytics or ads or anything. Only sentry for error tracking, and sentry is only on the server side.
No description
No description
nerdylive
nerdyliveβ€’2mo ago
I will retry it later after clearing app cache I think sentry Is fine on my vpn even I client side, maybe cache problem
Bumchat
Bumchatβ€’2mo ago
thanks for the heads up, i'll keep an eye out for anyone else having similar problems too
nerdylive
nerdyliveβ€’2mo ago
Oh wait it's my vpn? It's blocking your site hahah It was actually
Madiator2011
Madiator2011β€’2mo ago
@Bumchat was trying to help as much as I could though I do not have yet access to debug hardware stuff and not used vLLM worker much. Though thank you for having patience and also reporting such an issue πŸ™‚ . Though if you get more issues feel free to ping me any time.
Want results from more Discord servers?
Add your server
More Posts