R
RunPodโ€ข3w ago
Sarcagian

Serverless Requests Queuing Forever

Title says it all - I send a request to my serverless endpoint (just a test through the runpod website UI), and even though all of my workers are healthy, the request has just been sitting in the queue for over a minute. Am I being charged for time spent in queue as well as time spent on actual inference? If that's the case, then I'm burning a lot of money very fast lol. Am I doing something wrong?
178 Replies
riverfog7
riverfog7โ€ข3w ago
U need to specify more info like the image and model u r using
Jason
Jasonโ€ข3w ago
you're charged only when a worker is running, including loading the model ( not specifically time in queue, but can be) look at the worker tab, when it's green it's running Check a worker then check the log
Sarcagian
SarcagianOPโ€ข3w ago
understood, that clears that up at least. I'm running VLLM and attempting to use the gemma3 27b it model from googles repository
Jason
Jasonโ€ข3w ago
okay
Sarcagian
SarcagianOPโ€ข3w ago
the workers were all fully initialized and ready, but once I queued the request, they just didnt seem to do anything. I was getting a tokenizer error in the logs for the first model I tried running but didnt see any errors on the second
Jason
Jasonโ€ข3w ago
and which gpu model are you using?
Sarcagian
SarcagianOPโ€ข3w ago
I believe I had selected an H100
Jason
Jasonโ€ข3w ago
ic no logs at all? if you can please export or download the logs and just send it here
Sarcagian
SarcagianOPโ€ข3w ago
I will on my next attempt, had to move onto another project for a little while
riverfog7
riverfog7โ€ข3w ago
And how much did u wait? If you didnt specify a network volume, downloading models can take a long time 27b model is about 54GB
Sarcagian
SarcagianOPโ€ข3w ago
I watched the workers complete the download in their logs live
riverfog7
riverfog7โ€ข3w ago
Hiw about model load Can you upload the logs here
Sarcagian
SarcagianOPโ€ข3w ago
Like I said, I will once I go to try again. I've already removed that endpoint unfortuantely.
riverfog7
riverfog7โ€ข3w ago
One possibility is aold vllm version that doesnt support gemma3 The pr was merged 26days ago
riverfog7
riverfog7โ€ข3w ago
GitHub
[Model] Add support for Gemma 3 by WoosukKwon ยท Pull Request #1466...
This PR adds the support for Gemma 3, an open-source vision-language model from Google. NOTE: The PR doesn't implement the pan-and-scan pre-processing algorithm. It will be implemented by ...
Sarcagian
SarcagianOPโ€ข3w ago
I was wondering about that - I just assumed that the default VLLM container on the serverless option was up to date though Can I use any container off of a registry like you can with normal pods?
riverfog7
riverfog7โ€ข3w ago
From what i know it needs a handler for the requests But you can always build the vllm container with the latest vllm Dockerfile should be in runpod's official repo
Sarcagian
SarcagianOPโ€ข3w ago
haha yeah I've actually been trying to do that so I can build it with support for a 5090 which also is not going well I can build the image just fine, it's just not compiling with the right CUDA version no matter the modifications I make to the dockerfile
riverfog7
riverfog7โ€ข3w ago
Wdym by support for 5090
Sarcagian
SarcagianOPโ€ข3w ago
anyway that's unrelated default image doesnt use CUDA 12.8
riverfog7
riverfog7โ€ข3w ago
It should be file tho because noone uses cuda 12.8 . Its too new Cuda 12.4 should work fine with a 5090 @Sarcagian whats ur max midel len
Jason
Jasonโ€ข3w ago
vllm 0.8.2 supports gemma3 already what do you mean its not compiling with the right cuda version how did you check it, i thought you just choose a base image for that
riverfog7
riverfog7โ€ข3w ago
Yeah just looked that up
Sarcagian
SarcagianOPโ€ข3w ago
yeah that's what I did. I'm not familiar enough yet with VLLM to give better info unfortunately. Probably going to need another few hours of trying to figure this out to get to that point lol
riverfog7
riverfog7โ€ข3w ago
Cuda related stuff always causes headaches ๐Ÿ˜ช
Jason
Jasonโ€ข3w ago
no really, how did you check it, i wanna know
Sarcagian
SarcagianOPโ€ข3w ago
Check what specifically? The logs say as much in the container after powering it on once it's built.
Jason
Jasonโ€ข3w ago
Check the cuda version "It is not compiling with the right cuda version"
Sarcagian
SarcagianOPโ€ข3w ago
right - I changed the base image via the ARG for cuda version in the dockerfile. I'm going to go back at it later tonight but just havent had the chance yet. ARG CUDA_VERSION=12.8.1 FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base ARG CUDA_VERSION=12.8.1 ARG PYTHON_VERSION=3.12 ARG TARGETPLATFORM ENV DEBIAN_FRONTEND=noninteractive then later on theres an arg/env value called torch_cuda_arch_list which searches seem to indicate I should set to either 12.8 or 12.8.1. I believe this has something to dow ith how the flash attn modules are compiled or rather which versions of cuda they compile for but then after building the image with all of those changes, I still get the following when starting the container: NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90. If you want to use the NVIDIA GeForce RTX 5090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/ More info in this thread: https://github.com/vllm-project/vllm/issues/14452 Lots of other apps and projects out there where people are having the same issue with blackwell compatibility anyway, I just havent taken all the time needed to fully look into this nor is this what the current thread is about lol
Jason
Jasonโ€ข3w ago
Try using 12.1-12.4
Sarcagian
SarcagianOPโ€ข3w ago
12.1-12.4 are not compatible with Blackwell cards though it must be 12.8 or later
riverfog7
riverfog7โ€ข3w ago
Is this real?
Sarcagian
SarcagianOPโ€ข3w ago
as far as I can tell yes I'm not familiar enough with all of the intricate details but they don't support the new sm_120 compute capabilities of the blackwell cards just yet I didn't know this until after I bought the card obviously, but tbh I'm still keeping it since I'm sure the support will come soon
Sarcagian
SarcagianOPโ€ข3w ago
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-12-8-update-1-release-notes "This release adds compiler support for the following Nvidia Blackwell GPU architectures: SM_100 SM_101 SM_120" So actually it appears you need CUDA 12.8.1 specifically for blackwell
riverfog7
riverfog7โ€ข3w ago
Did you try other versions? Like 12.4 Cards with higher cuda compute capabilities should support lower cuda versions It could be a version mismatch between your graphics driver and pytorch
Sarcagian
SarcagianOPโ€ข3w ago
hmm interesting. I'm on 570.124 on Linux so could be something there. Havent tried anything in windows but maybe I'll give that a shot next
riverfog7
riverfog7โ€ข3w ago
I mean cuda toolkit What does yourr nvcc -V Print?
Sarcagian
SarcagianOPโ€ข3w ago
well my driver situation is pretty messed up, but I still don't think that's the issue exactly. CUDA applications that explicitly support the new compute capabilities and CUDA version seem to work just fine.
riverfog7
riverfog7โ€ข3w ago
Any results? It should print 12.8 something
Jason
Jasonโ€ข3w ago
If their container image is 12.8 then it will be that, is the pod host 12.8(via pod create filter)?
Sarcagian
SarcagianOPโ€ข3w ago
It's all a bit more complex that I thought after more research. For now I'm just sticking with ollama locally until full explicit support for Blackwell is included in a vLLM release
riverfog7
riverfog7โ€ข3w ago
GitHub
[Feature]: Support for RTX 5090 (CUDA 12.8) ยท Issue #13306 ยท vllm...
๐Ÿš€ The feature, motivation and pitch Currently only nightlies from torch targeting 12.8 support blackwell such as the rtx 5090. I tried using VLLM with a rtx 5090 and no dice. Vanilla vllm installat...
Jason
Jasonโ€ข3w ago
huh requires custom vllm build and nightly packages nice
Sarcagian
SarcagianOPโ€ข3w ago
yeah, I haven't gone back to it but I did try this once and it didnt quite work. Going to give it another go right now I think so. oh, this wasnt the issue/post I was following instructions from, this one is way better damn thank you this will probably work, now just need to find something similar for SGLang
riverfog7
riverfog7โ€ข3w ago
@Sarcagian the issue mentions torch version upgrades so doing that may make sglang work too
Sarcagian
SarcagianOPโ€ข3w ago
I think this is what I was missing thank you guys so much, seriously
riverfog7
riverfog7โ€ข3w ago
Btw can you give us an update if it works? Just in case someone has to use a B200 or a RTX 5090 to deploy vllm
Sarcagian
SarcagianOPโ€ข3w ago
will do oh you know what? SGLang's dockerfile uses triton server as it's base image, so that's going to be inherently different. I have zero knowledge yet on Triton haha. Maybe it can be swapped for a different base image or the same that vLLM uses? Not too sure if there are specific dependencies there with triton.
riverfog7
riverfog7โ€ข3w ago
If its torch based wont that solution work? @Sarcagian why r u using sglang tho? Im interested in building a dockerfile because i may use it in the future
Sarcagian
SarcagianOPโ€ข3w ago
honestly I'm not sure yet, but according to searches it's somehow better for tool usage, ettc whereas vLLM excels more at speeding up inference and serving simultaneous requests SGLang is I think built from vLLM though still pretty new to anything outside of Ollama so I'm probably not the best person to ask lol if you do please share haha
riverfog7
riverfog7โ€ข3w ago
what's your HW and model hoenstly ollama is not bad for non-cocurrent requests but vllm is way better (like literally 5+ times better) if the requests can be batched
Sarcagian
SarcagianOPโ€ข3w ago
yeah, I'm preparing to deploy in a high traffic prod environment though lol. So I need something a little more robust
riverfog7
riverfog7โ€ข3w ago
for example 2xA40 with a 70B llama
Sarcagian
SarcagianOPโ€ข3w ago
HW?
riverfog7
riverfog7โ€ข3w ago
will get about 200tok/s with batched requests hardware you'll be deploying to
Sarcagian
SarcagianOPโ€ข3w ago
not sure quite yet which model but up to about 110b parameters as far as model sizes go. We'll be evaluating a bunch of different models initially before we decide on one it'll likely be cloud hosted though
riverfog7
riverfog7โ€ข3w ago
lol its big r u a startup?
Sarcagian
SarcagianOPโ€ข3w ago
no more details on that I wish to share at the moment haha
riverfog7
riverfog7โ€ข3w ago
anyways 110b at fp8 or int4?
Sarcagian
SarcagianOPโ€ข3w ago
very much looking forward to getting my hands on a DGX spark and/or station for this stuff soon though probably at least fp8 smaller models that I'm looking at are in the 24-32b range and those I want to run at full fp16 I need to spend some time educating myself on the practical differences in accuracy between different quants
riverfog7
riverfog7โ€ข3w ago
isnt it not enough considering it has 128gigs of vram?
Sarcagian
SarcagianOPโ€ข3w ago
the spark?
riverfog7
riverfog7โ€ข3w ago
you wont be able to batch request so much yeah
Sarcagian
SarcagianOPโ€ข3w ago
yeah I'm going to probably get a spark for dev use, looking more at the station for potential prod stuff the memory bandwidth on the spark is pretty low, but can't beat the 128GB available for loading models not at that price anyway
riverfog7
riverfog7โ€ข3w ago
if you have many users you have to go cloud anyway and you should have the latency & throughput requirements cause more batched requests = more latency and more throughput have to find a middle ground there and determine the memory requirements based on that
riverfog7
riverfog7โ€ข3w ago
GitHub
[Doc]: Steps to run vLLM on your RTX5080 or 5090! ยท Issue #14452 ...
๐Ÿ“š The doc issue Let's take a look at the steps required to run vLLM on your RTX5080/5090! Initial Setup: To start with, we need a container that has CUDA 12.8 and PyTorch 2.6 so that we have nv...
Sarcagian
SarcagianOPโ€ข3w ago
same actually, about to run the build are you on docker?
riverfog7
riverfog7โ€ข3w ago
just ran the image on runpod ill try to build vllm there and if it works write the dockerfile but the terminal doesnt work ๐Ÿ˜ฆ
Sarcagian
SarcagianOPโ€ข3w ago
ah bummer
riverfog7
riverfog7โ€ข3w ago
are you running it locally?
Sarcagian
SarcagianOPโ€ข3w ago
yeah on build now I'm getting this: ------ Dockerfile:135 -------------------- 134 | ENV CCACHE_DIR=/root/.cache/ccache 135 | >>> RUN --mount=type=cache,target=/root/.cache/ccache \ 136 | >>> --mount=type=cache,target=/root/.cache/uv \ 137 | >>> --mount=type=bind,source=.git,target=.git \ 138 | >>> if [ "$USE_SCCACHE" != "1" ]; then \ 139 | >>> # Clean any existing CMake artifacts 140 | >>> rm -rf .deps && \ 141 | >>> mkdir -p .deps && \ 142 | >>> python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \ 143 | >>> fi 144 |
-------------------- ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref I never modified this section though, nor do I quite understand what it means lol
riverfog7
riverfog7โ€ข3w ago
apt-get update && apt-get install -y --no-install-recommends \ kmod \ git \ python3-pip \ ccache try this installing ccache
Sarcagian
SarcagianOPโ€ข3w ago
ah nice ty
riverfog7
riverfog7โ€ข3w ago
No description
Sarcagian
SarcagianOPโ€ข3w ago
still fairly new to docker tbh, only been working with it for like 3-4 months now RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \ && apt-get update -y \ && apt-get install -y ccache software-properties-common git curl wget sudo vim python3-pip \ && apt-get install -y ffmpeg libsm6 libxext6 libgl1 \ ccache is installed right at the top of the dockerfile though oh my target was wrong trying to eventually get to the openai server image from the base
riverfog7
riverfog7โ€ข3w ago
? Are you building the image yourself or using the image from nvidia
Sarcagian
SarcagianOPโ€ข3w ago
nvidia
riverfog7
riverfog7โ€ข3w ago
nvcr.io/nvidia/pytorch:25.02-py3 Is it this one?
Sarcagian
SarcagianOPโ€ข3w ago
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base You talking about this?
riverfog7
riverfog7โ€ข3w ago
Uhh
Sarcagian
SarcagianOPโ€ข3w ago
I must be lost lol
riverfog7
riverfog7โ€ข3w ago
Yeah I mean this image probably has torch with blackwell support So the gh issue says just install vllm on top of it
Sarcagian
SarcagianOPโ€ข3w ago
I switched out the base image for the one you posted, but still getting that ccache issue I'm still trying to modify the offical dockerfile though
riverfog7
riverfog7โ€ข3w ago
I thin you dont have to do that
riverfog7
riverfog7โ€ข3w ago
In here it says that image has the torch and python stuff
No description
riverfog7
riverfog7โ€ข3w ago
So you have to clone vllm and then build it with the compiler supporting blackwell And then its done
Sarcagian
SarcagianOPโ€ข3w ago
gonna try it
riverfog7
riverfog7โ€ข3w ago
try this
FROM nvcr.io/nvidia/pytorch:25.02-py3 as base
WORKDIR /tmp
RUN apt-get update && apt-get install -y --no-install-recommends \
kmod \
git \
python3-pip \
ccache \
&& apt-get clean && rm -rf /var/lib/apt/lists/*

ENV VLLM_FLASH_ATTN_VERSION=2

RUN git clone https://github.com/vllm-project/vllm.git && cd vllm
RUN python3 use_existing_torch.py && pip install -r requirements/build.txt && pip install setuptools_scm
RUN --mount=type=cache,target=/home/root/.cache/ccache MAX_JOBS=10 CCACHE_DIR=/home/root/.cache/ccache python3 setup.py develop && cd /tmp/ && rm -r vllm
RUN python3 -c "import vllm; print(vllm.__version__)"

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
FROM nvcr.io/nvidia/pytorch:25.02-py3 as base
WORKDIR /tmp
RUN apt-get update && apt-get install -y --no-install-recommends \
kmod \
git \
python3-pip \
ccache \
&& apt-get clean && rm -rf /var/lib/apt/lists/*

ENV VLLM_FLASH_ATTN_VERSION=2

RUN git clone https://github.com/vllm-project/vllm.git && cd vllm
RUN python3 use_existing_torch.py && pip install -r requirements/build.txt && pip install setuptools_scm
RUN --mount=type=cache,target=/home/root/.cache/ccache MAX_JOBS=10 CCACHE_DIR=/home/root/.cache/ccache python3 setup.py develop && cd /tmp/ && rm -r vllm
RUN python3 -c "import vllm; print(vllm.__version__)"

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
Sarcagian
SarcagianOPโ€ข3w ago
kk what I'm not quite clear on, and this is just my lack of knowledge on the subject, is that as I watch the flash attn builds happen, it only appears to do sm80 and sm90?
riverfog7
riverfog7โ€ข3w ago
maybe its not ready for blackwell too
Sarcagian
SarcagianOPโ€ข3w ago
but those older compute versions should still work for flash attn? well on blackwell
riverfog7
riverfog7โ€ข3w ago
hmm i dont know cuda well soo
Sarcagian
SarcagianOPโ€ข3w ago
building it now but it takes forever once it gets to the flash attn cmake steps. I did this once before and built a dockerfile based on that issue page, but I was missing the entrypoint line I think that might be all I was missing, so I think this will do it hopefully
riverfog7
riverfog7โ€ข3w ago
did u try this one?
Sarcagian
SarcagianOPโ€ข3w ago
yeah just started building that maxjobs=10 should speed up the cmake steps I take it?
riverfog7
riverfog7โ€ข3w ago
if you have 10 cores yes
Sarcagian
SarcagianOPโ€ข3w ago
oh I do
riverfog7
riverfog7โ€ข3w ago
setting it much higher than the core count makes the machine sort of freeze
Sarcagian
SarcagianOPโ€ข3w ago
ah gotcha
riverfog7
riverfog7โ€ข3w ago
cuz it uses all da cores for building
Sarcagian
SarcagianOPโ€ข3w ago
Clone the vLLM repository. RUN git clone https://github.com/vllm-project/vllm.git Change working directory to the cloned repository. WORKDIR /tmp/vllm had to modify it a bit to change working dir after clone
riverfog7
riverfog7โ€ข3w ago
oof my bad
Sarcagian
SarcagianOPโ€ข3w ago
# syntax=docker/dockerfile:1.4 FROM nvcr.io/nvidia/pytorch:25.02-py3 as base WORKDIR /tmp # Install required packages. RUN apt-get update && apt-get install -y --no-install-recommends \ kmod \ git \ python3-pip \ ccache \ && apt-get clean && rm -rf /var/lib/apt/lists/* # Set environment variable required by vLLM. ENV VLLM_FLASH_ATTN_VERSION=2 # Clone the vLLM repository. RUN git clone https://github.com/vllm-project/vllm.git # Change working directory to the cloned repository. WORKDIR /tmp/vllm # Run the preparatory script and install build dependencies. RUN python3 use_existing_torch.py && \ pip install -r requirements/build.txt && \ pip install setuptools_scm # Build vLLM from source in develop mode. RUN --mount=type=cache,target=/root/.cache/ccache \ MAX_JOBS=10 CCACHE_DIR=/root/.cache/ccache \ python3 setup.py develop && \ cd /tmp && rm -rf vllm # Test the installation by printing the vLLM version. RUN python3 -c "import vllm; print(vllm.__version__)" # Set the entrypoint to start the vLLM OpenAI-compatible API server. ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] now we're cookin
riverfog7
riverfog7โ€ข3w ago
maybe if it works building sglang with that could work
Sarcagian
SarcagianOPโ€ข3w ago
yeah gonna try it
riverfog7
riverfog7โ€ข3w ago
also i found nvidia's official(idk but it says nvidia) image for triton inference server so that's a candidate for sglang base image
Sarcagian
SarcagianOPโ€ข3w ago
nicee
riverfog7
riverfog7โ€ข3w ago
just got ssh working
riverfog7
riverfog7โ€ข3w ago
it appears that torch is working with blackwell
No description
Sarcagian
SarcagianOPโ€ข3w ago
very nice about halfway done building my image
riverfog7
riverfog7โ€ข3w ago
good sign on my side too
No description
Sarcagian
SarcagianOPโ€ข3w ago
what build command and args did you use?
riverfog7
riverfog7โ€ข3w ago
its the same (except the core count) as the dockerfile CCACHE_DIR=/home/root/.cache/ccache python3 setup.py develop just this uhoh its setup.py develop shouldn't have deleted the code lol
Sarcagian
SarcagianOPโ€ข3w ago
not following "its setup.py develop shouldn't have deleted the code" What do you mean?
riverfog7
riverfog7โ€ข3w ago
Stackoverflow says develop links the code in the repo to site packages So if i delete the repo it might break
Sarcagian
SarcagianOPโ€ข3w ago
oh the cloned repo in the container?
riverfog7
riverfog7โ€ข3w ago
yeah
Sarcagian
SarcagianOPโ€ข3w ago
ah gotcha somehow that last build locked my PC up haha, had to start over ๐Ÿ˜ฆ
riverfog7
riverfog7โ€ข3w ago
and dont clone at /tmp if you are not gonna delete it
Sarcagian
SarcagianOPโ€ข3w ago
where should I clone to then?
riverfog7
riverfog7โ€ข3w ago
/tmp gets removed (cuz obviously its temporary)
Sarcagian
SarcagianOPโ€ข3w ago
ah yeah
riverfog7
riverfog7โ€ข3w ago
maybe the home folder or /workspace? home folder will be good
Sarcagian
SarcagianOPโ€ข3w ago
workspace will do or even /app
riverfog7
riverfog7โ€ข3w ago
isnt it the network volume mount folder?
Sarcagian
SarcagianOPโ€ข3w ago
no idea, not that fluent in docker yet lol
riverfog7
riverfog7โ€ข3w ago
just go with /app or /vllm then
Sarcagian
SarcagianOPโ€ข3w ago
yeah I used /app just restarted the build
riverfog7
riverfog7โ€ข3w ago
if mine finishes faster ill give you the wheel file (im building with setup.py bdist_wheel) a wheel is just a prebuilt binary
Sarcagian
SarcagianOPโ€ข3w ago
nice
riverfog7
riverfog7โ€ข3w ago
it failed while building @Sarcagian did u suceed i think it needs a LOT of ram
Sarcagian
SarcagianOPโ€ข3w ago
had to restart my build, only halfway through the cmake steps
riverfog7
riverfog7โ€ข3w ago
me too how much is ur ram? mine failed with 96gigs in runpod's RTX5090
Sarcagian
SarcagianOPโ€ข3w ago
I've got plenty haha, more than that
riverfog7
riverfog7โ€ข3w ago
why did it fail tho
Sarcagian
SarcagianOPโ€ข3w ago
no idea, havent looked at the logs yet sec
riverfog7
riverfog7โ€ข3w ago
u r rich lol
Sarcagian
SarcagianOPโ€ข3w ago
no, just irresponsible with what I buy haha
riverfog7
riverfog7โ€ข3w ago
i just trid to run it on a macbook failed miserably
Sarcagian
SarcagianOPโ€ข3w ago
hahaha
riverfog7
riverfog7โ€ข3w ago
had only 48gigs ๐Ÿ˜ฆ
Sarcagian
SarcagianOPโ€ข3w ago
oof I've only got like 25GB of RAM used up at tthe moment, that's odd it failed with that much system RAM
riverfog7
riverfog7โ€ข3w ago
idk either maybe because it had to run with rosetta
riverfog7
riverfog7โ€ข3w ago
No description
riverfog7
riverfog7โ€ข3w ago
it uses almost 100gigs here XD its hella fast tho power of 32 vCPUs
riverfog7
riverfog7โ€ข3w ago
frick
No description
riverfog7
riverfog7โ€ข3w ago
it OOMed
Sarcagian
SarcagianOPโ€ข3w ago
oh wow lol, I wonder why the RAM usage is so high? not having anything close to that building locally
riverfog7
riverfog7โ€ข3w ago
๐Ÿฅฒ
riverfog7
riverfog7โ€ข3w ago
this one selected wrong region
No description
Sarcagian
SarcagianOPโ€ข3w ago
oh wow haha
riverfog7
riverfog7โ€ข3w ago
that type is not supported lol got one with 128vcpus and 1tb ram
Sarcagian
SarcagianOPโ€ข3w ago
FYI dont build it in develop mode, it failed on the last step, starting over again lol
riverfog7
riverfog7โ€ข3w ago
im building on wheels mode it ooms cuz of many workers so im just building with single worker probably finishes building tomorrow
Jason
Jasonโ€ข3w ago
some says sglang is faster in certain models
Sarcagian
SarcagianOPโ€ข3w ago
can vllm server multiple models simultaneously or dynamically unload/load different models as needed?
Jason
Jasonโ€ข3w ago
I think so with vllm api but it's not wzposwd in the serverless youll have to use pods and directly connect to vllm Or use ports on serverless
Sarcagian
SarcagianOPโ€ข3w ago
gotcha well I made some serious progress on getting vLLM working and built for blackwell, but after all that, it seems there's no way to compile xformers to work with torch 2.8.x dev builds so I wont be able to use models like gemma3 I realize that it takes time to develop this stuff but it's extremely frustrating that nvidia would release a new architecture, charge thousands of dollars for the GPU, and not support this part of the community especially to help develop support for sm120 and cuda 12.8 I'm beyond angry right now, I've been at this since this morning
riverfog7
riverfog7โ€ข3w ago
GitHub
Please support RTX 50XX GPUs ยท Issue #1856 ยท unslothai/unsloth
It is very challenging to run on RTX 50XX GPUs on Windows. Are there any good solutions? LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32. Has anyone encountered this error?
riverfog7
riverfog7โ€ข3w ago
No description
riverfog7
riverfog7โ€ข2w ago
this is like deep into the rabbit hole[ ๐Ÿ˜„ have to compile every fking thing @Sarcagian TORCH_CUDA_ARCH_LIST="12.0" pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers i cant test it cuz i dont have. a blackwell gpu ah i can run it on runpod maybe
Jason
Jasonโ€ข2w ago
rtx5090 i mean you can use this gpu its blackwell too right?
riverfog7
riverfog7โ€ข2w ago
yeah his gpu is a 5090 so that should work can just upload the wheel instead of compiling from scratch
Jason
Jasonโ€ข2w ago
this is weird im trying 5090 on community cloud, no system logs after like 15mins ish and then suddenly web terminal also disconnected, but container logs are there
riverfog7
riverfog7โ€ข2w ago
lol i used it yesterday and it was fine (with a custom image tho)
Jason
Jasonโ€ข2w ago
i used runpod's pytorch
riverfog7
riverfog7โ€ข2w ago
its not supported with blackwell i mean it doesnt work for blackwell cuz cuda capability issues
Jason
Jasonโ€ข2w ago
hmm? oh its 12.8 tho doesnt it means it supports blackwell too
riverfog7
riverfog7โ€ข2w ago
there is an image with cuda 12.8? oh
Jason
Jasonโ€ข2w ago
Yeah maybe it's a new one
riverfog7
riverfog7โ€ข2w ago
when did it pop up the reason of life just disappeared wtf why is there an image with a different torch version than what ive built the wheel to
Jason
Jasonโ€ข2w ago
huh what do you mean? maybe its a dev build?
riverfog7
riverfog7โ€ข2w ago
i built the thing for torch 2.7.1 but the image is 2.8.0 so i probably cant use the wheel with that image have to stick with nvidia's bloated
Sarcagian
SarcagianOPโ€ข2w ago
Oh man, I had just sworn off continuing to pursue this and just waiting for official support. And then you throw this at me lol. Now I'm gonna have to go back at it at least a little today. I realized though, I've learned a ton of good info I didn't know two days ago throughout trying to solve this problem lol. So not all bad.
riverfog7
riverfog7โ€ข2w ago
nah ur not fully in that rabbit hole
Sarcagian
SarcagianOPโ€ข2w ago
Hahahaha
riverfog7
riverfog7โ€ข2w ago
u have to compile triton and sglang ๐Ÿ˜†
Sarcagian
SarcagianOPโ€ข7d ago
I think I just did it haha. I'll post details later. Need sleep.

Did you find this page helpful?