Topics

RunPod•12mo ago

Serverless broke for me overnight, I can't get inference to run at all.

Hi, I was using runpod/worker-vllm:stable-cuda12.1.0 in my production app with the model TheBloke/dolphin-2.7-mixtral-8x7b-AWQ. There appears to have been an update in the last 24 hours or so that broke my app completely. I have since spent the last six hours trying to get ANYTHING out of ANY endpoint, and I just can't get anything running. Prior to today, this was running uninterrupted for over a month. I have tried: - Rolling back to runpod/worker-vllm:0.3.1-cuda12.1.0 - Swapping out models; tried easily 8 or 9 different ones, mostly mixtral variants. I have tried AWQ, GPTQ and unquantized models. Logs and observations in thread (post was too long)

66 Replies

Mandragora.aiOP•12mo ago

logs in attachment

message.txt

Mandragora.aiOP•12mo ago

And then just nothing in either log, ever again. No errors, nothing. Same result on the new vllm stable version. Manual requests made using the tool on this page immediately go into the "IN_QUEUE" state, and never ever return. Nothing is reflected in the logs to indicate that a request was even made.

Mandragora.aiOP•12mo ago

the gpu utilisation or memory usage never goes up either, this implies to me that its not even loading the model.

No description

Mandragora.aiOP•12mo ago

my envrionment variables. I've been messing with these all day, but i'm fairly certain this is the state they were in before today

No description

Mandragora.aiOP•12mo ago

my endpoint config, which is definitely the same that i had yesterday.

No description

Mandragora.aiOP•12mo ago

i've spent six hours on this so far today. Is there anything obvious that i'm missing? I'm not even getting errors that I can action.

Mandragora.aiOP•12mo ago

I just found that the "logs" in the middle of the page has slightly different information than the logs on the worker at the top of the page.

message.txt

Mandragora.aiOP•12mo ago

Those messages aren't clear to me, I don't know what action I can take to remedy them. Are they even errors? SIGTERM is a request to terminate a program. Maybe its terminating and then not listening for requests? Even if i turn the "execution timeout" on, it gets ignored.

Mandragora.aiOP•12mo ago

just found this in the relevant inbox, i believe this was the original issue. i think i am still suffering from it.

No description

Mandragora.aiOP•12mo ago

(email arrived around 15 hours ago, i've been aware of the issue and trying to trouble shoot it for eight hours straight now)

Jason•12mo ago

Hahaha finally

Mandragora.aiOP•12mo ago

i still havent been able to solve the issue yet, i cant get any inference to run at all So i've spent around 15 hours today troubleshooting this issue. it has been the single most frustrating day of my life. i still cant get any inference to run on serverless endpoints at all. Support's response was "i can't get it to work either, make an issue on github". Its time for me to leave runpod behind and go somewhere else.

digigoblin•12mo ago

@Alpay Ariyak any idea?

Alpay Ariyak•12mo ago

Hi, the stable image is still 0.3.2, which is the same image it was before yesterday, I had to reupload it bc a github action tried to push main branch as stable Investigating this now Thanks for the ping @digigoblin

digigoblin•12mo ago

I think support should assign vllm support issues to you if they can't figure out the problem rather than telling people to log an issue on Github.

Alpay Ariyak•12mo ago

Yeah good idea, will bring it up I think I may have figured the issue out Patching ASAP

Mandragora.aiOP•12mo ago

i'm keen to know what you've found

Alpay Ariyak•12mo ago

I believe the wrong base image might've been used during the build somehow, rebuilding everything

Mandragora.aiOP•12mo ago

thank you, i really appreciate it

Alpay Ariyak•12mo ago

Yeah ofc! Sorry for this experience and getting back to you late, I'm currently in the EST timezone

Mandragora.aiOP•12mo ago

its ok, its almost 2am here, i have no idea about american timezones, but i'm used to waiting a day for support responses for basically any service. i've just been trying to solve it myself all day not solve it, more like work around it i guess

Alpay Ariyak•12mo ago

Ahh got you, I see, it's 11am here currently I'm testing alpayariyakrunpod/worker-vllm:stable-cuda12.1.0 now with your endpoint configuration

digigoblin•12mo ago

Now that RunPod has received additional investment, there should be support staff across timezones. I also have to regularly wait several hours for response to production issues. RunPod has customers all over the world, not just US, so staff shouldn't all be based in the US. There are people like @Papa Madiator who are available within the other time zones but his access is too restricted and he can't help with more complex issues.

Madiator2011•12mo ago

That might be changing soon 🙂

Alpay Ariyak•12mo ago

Okay, what I think happened: The requirements.txt files in the vllm build don't pin versions, so when I rebuilt stable yesterday due to the original being replaced by github automatic build, it installed newer versions of those packages and that broke something Luckily, the original version of cuda 11.8 version of 0.3.2(stable) , remained, so I was able to pull it and grab all of the package versions

Mandragora.aiOP•12mo ago

thats good news 😁

Alpay Ariyak•12mo ago

Now rebuilding with hopefully correct versions

Mandragora.aiOP•12mo ago

thank you

Alpay Ariyak•12mo ago

Ofc, thank you for your patience

Mandragora.aiOP•12mo ago

theres always a silver lining, i learned an absolute ton about vllm, awq, gptq and skypilot today i see a push to the git repo just now

Alpay Ariyak•12mo ago

For sure, glad to hear that Unfortunately, that fix didn't seem to help, so I'm trying to see if the new update will work with that config and setting it as stable Even with the new update, also stuck on started ray worker and 1% memory trying to enable enforce eager and trust remote code now

Mandragora.aiOP•12mo ago

one thing i've seen repeatedly today; even using worker-vllm:0.3.2-cuda12.1.0 didn't work, which if my understanding is correct, hasn't been changed since march, and should be the exact image that was worker-vllm:stable-cuda12.1.0 before yesterday, right?

Alpay Ariyak•12mo ago

worker-vllm:0.3.2-cuda12.1.0 was rebuilt and repushed

Mandragora.aiOP•12mo ago

ahh, that makes sense

Alpay Ariyak•12mo ago

worker-vllm:0.3.2-cuda11.8.0 wasn't, I'll try that

Mandragora.aiOP•12mo ago

i tried 8 or 9 models across different sizes, quants and architectures(?) and every combination of settings, environment variables and versions i could think of. same pause at ray worker each time. i did NOT try cuda11.8.0

Alpay Ariyak•12mo ago

All of that was multi-gpu on A40s?

Mandragora.aiOP•12mo ago

yeah, 2x a40 or a6000 every time, i never changed that variable even on the tiny models i kept that the same

Alpay Ariyak•12mo ago

same issue on the unchanged stable 11.8.0 Gotta love the lack of logs in the ray initialization

Mandragora.aiOP•12mo ago

makes me think of this one then, some kind of network issue that never fully resolved maybe thats what kills me, it doesnt give me anything actionable at all

Alpay Ariyak•12mo ago

I feel your pain

Mandragora.aiOP•12mo ago

in your testing today did you use network storage volumes at all? i've been using EU-SE-1 exclusively, thats another variable that i haven't changed

Alpay Ariyak•12mo ago

In the past, what fixed this (specifically for multi-gpu) was using physical CPU count to initialize ray It does that by default now, but I'm gonna try lowering the amount of CPUs used, I set up an env var for it VLLM_CPU_FRACTION No

Mandragora.aiOP•12mo ago

but if the cuda11.8.0 images havent changed, and they're broken too, doesn't that effectively rule out basically everything in the images?

Alpay Ariyak•12mo ago

Likely so, but I'm not sure how the machines could have changed in a way that affects this either, so trying to exhaust all possible options on worker code level

Mandragora.aiOP•12mo ago

thanks for your attention on this. Its after 3am on saturday morning here now, I'm too old to pull an all nighter these days. I'm going to go grab a few hours sleep. thanks again btw, with yourself and the support person both able to replicate this so easily; are there no other customers with the same issue? if its working for someone, maybe its worth comparing notes to find out whats different with their setup.

Alpay Ariyak•12mo ago

Sounds good, will fix this by the time you’re back up, thanks! I’m guessing not that many people are doing multi-gpu, the issue is contained to that scenario

maple•12mo ago

Yes not sure if this is connected to serverless, but I have been doing dev work on vllm in a pod on the secure cloud. And within the last 1-2 days also have been stuck on Ray initialization/worker creation I am using the exact same commands and installation as just a few days ago, which worked fine tried on multiple different GPUs this is with multi-GPU setup on vllm

Alpay Ariyak•12mo ago

this is incredibly useful, I've noticed the same on secure cloud yesterday Before this, did it work?

maple•12mo ago

yes, it did

maple•12mo ago

I did notice these warnings from vllm that is not present on baremetal machines that had no problem starting ray (However I don't remember if they have always been there)

No description

maple•12mo ago

full output and cmd where I saw this:

No description

maple•12mo ago

@Alpay Ariyak More details I remember that may be helpful, I first started experiencing hanging ray init on EU-SE-1 A4000/A5000 instances. At the same time ray init was working fine on US-OR-1 A100 SXM instances at some point yesterday(?) ray init stopped working on both

Alpay Ariyak•12mo ago

Thanks a lot @maple for confirming this, it indeed is a wider issue affecting all machines and unrelated to worker vLLM Related to a machine agent release that was made yesterday, the team is working on rolling it back ASAP It's absolutely terrible that production is broken as a result, but I'm glad to know now it wasn't anything I did with Worker vLLM, I was driving myself crazy trying to figure out what I did that could have caused it, as all leads led to dead ends haha - the timing with the repushed worker image was just too perfect for it to be the main suspect

maple•12mo ago

Great, could you please let me know when this is rolled back?

Alpay Ariyak•12mo ago

Yes of course

Zeen•12mo ago

it should be live in less than 30m can you DM me your runpod email? We'll figure out some comp for this - really sorry for the issues this caused no rush on that ofc

Mandragora.aiOP•12mo ago

is there anything i might need to do at my end to get it running again? I just activated a worker on the endpoint, and it did actually load the model into memory. Which is way further than I got at any point yesterday. but it still not running inference; the requests are still stuck at IN_QUEUE. I'm about to start playing with my environment variables again in case they're in an invalid state YES! I finally got some inference output! my app is back up and running! Only 26 hours of downtime and 186 new signups hit with "Sorry we're down" @Alpay Ariyak thanks for your hard work with this 🎉

Jason•12mo ago

Hey, why do I get this on your site? Is it my phone only

No description

Mandragora.aiOP•12mo ago

no idea, i'll see if i can replicate it. i haven't seen that issue myself, we have a letsencrypt ssl cert. thanks for the heads up

Jason•12mo ago

Alright no problem

Mandragora.aiOP•12mo ago

i'm not able to replicate it across OSes (iOS, Android, MacOS), browsers (Chrome, FF, Safari), and networks (cell, wifi or protonvpn). Is it possibly your VPN? We dont have any third party analytics or ads or anything. Only sentry for error tracking, and sentry is only on the server side.

No description

No description

Jason•12mo ago

I will retry it later after clearing app cache I think sentry Is fine on my vpn even I client side, maybe cache problem

Mandragora.aiOP•12mo ago

thanks for the heads up, i'll keep an eye out for anyone else having similar problems too

Jason•12mo ago

Oh wait it's my vpn? It's blocking your site hahah It was actually

Madiator2011•12mo ago

@Bumchat was trying to help as much as I could though I do not have yet access to debug hardware stuff and not used vLLM worker much. Though thank you for having patience and also reporting such an issue 🙂 . Though if you get more issues feel free to ping me any time.

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

17KMembers

View on Discord

Did you find this page helpful?