Serverless Requests Queuing Forever
Title says it all - I send a request to my serverless endpoint (just a test through the runpod website UI), and even though all of my workers are healthy, the request has just been sitting in the queue for over a minute.
Am I being charged for time spent in queue as well as time spent on actual inference? If that's the case, then I'm burning a lot of money very fast lol. Am I doing something wrong?
178 Replies
U need to specify more info like the image and model u r using
you're charged only when a worker is running, including loading the model ( not specifically time in queue, but can be) look at the worker tab, when it's green it's running
Check a worker then check the log
understood, that clears that up at least. I'm running VLLM and attempting to use the gemma3 27b it model from googles repository
okay
the workers were all fully initialized and ready, but once I queued the request, they just didnt seem to do anything. I was getting a tokenizer error in the logs for the first model I tried running but didnt see any errors on the second
and which gpu model are you using?
I believe I had selected an H100
ic
no logs at all?
if you can please export or download the logs and just send it here
I will on my next attempt, had to move onto another project for a little while
And how much did u wait?
If you didnt specify a network volume, downloading models can take a long time
27b model is about 54GB
I watched the workers complete the download in their logs live
Hiw about model load
Can you upload the logs here
Like I said, I will once I go to try again. I've already removed that endpoint unfortuantely.
One possibility is aold vllm version that doesnt support gemma3
The pr was merged 26days ago
GitHub
[Model] Add support for Gemma 3 by WoosukKwon ยท Pull Request #1466...
This PR adds the support for Gemma 3, an open-source vision-language model from Google.
NOTE:
The PR doesn't implement the pan-and-scan pre-processing algorithm. It will be implemented by ...
I was wondering about that - I just assumed that the default VLLM container on the serverless option was up to date though
Can I use any container off of a registry like you can with normal pods?
From what i know it needs a handler for the requests
But you can always build the vllm container with the latest vllm
Dockerfile should be in runpod's official repo
haha yeah I've actually been trying to do that so I can build it with support for a 5090
which also is not going well
I can build the image just fine, it's just not compiling with the right CUDA version no matter the modifications I make to the dockerfile
Wdym by support for 5090
anyway that's unrelated
default image doesnt use CUDA 12.8
It should be file tho because noone uses cuda 12.8 . Its too new
Cuda 12.4 should work fine with a 5090
@Sarcagian whats ur max midel len
vllm 0.8.2 supports gemma3 already
what do you mean its not compiling with the right cuda version
how did you check it, i thought you just choose a base image for that
Yeah just looked that up
yeah that's what I did. I'm not familiar enough yet with VLLM to give better info unfortunately. Probably going to need another few hours of trying to figure this out to get to that point lol
Cuda related stuff always causes headaches ๐ช
no really, how did you check it, i wanna know
Check what specifically?
The logs say as much in the container after powering it on once it's built.
Check the cuda version
"It is not compiling with the right cuda version"
right - I changed the base image via the ARG for cuda version in the dockerfile. I'm going to go back at it later tonight but just havent had the chance yet.
ARG CUDA_VERSION=12.8.1
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base
ARG CUDA_VERSION=12.8.1
ARG PYTHON_VERSION=3.12
ARG TARGETPLATFORM
ENV DEBIAN_FRONTEND=noninteractive
then later on theres an arg/env value called torch_cuda_arch_list which searches seem to indicate I should set to either 12.8 or 12.8.1. I believe this has something to dow ith how the flash attn modules are compiled or rather which versions of cuda they compile for
but then after building the image with all of those changes, I still get the following when starting the container:
NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.
If you want to use the NVIDIA GeForce RTX 5090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
More info in this thread: https://github.com/vllm-project/vllm/issues/14452
Lots of other apps and projects out there where people are having the same issue with blackwell compatibility
anyway, I just havent taken all the time needed to fully look into this nor is this what the current thread is about lol
Try using 12.1-12.4
12.1-12.4 are not compatible with Blackwell cards though
it must be 12.8 or later
Is this real?
as far as I can tell yes
I'm not familiar enough with all of the intricate details but they don't support the new sm_120 compute capabilities of the blackwell cards just yet
I didn't know this until after I bought the card obviously, but tbh I'm still keeping it since I'm sure the support will come soon
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-12-8-update-1-release-notes
"This release adds compiler support for the following Nvidia Blackwell GPU architectures:
SM_100
SM_101
SM_120"
So actually it appears you need CUDA 12.8.1 specifically for blackwell
1. CUDA 12.8 Update 1 Release Notes โ Release Notes 12.8 document...
The Release Notes for the CUDA Toolkit.
Did you try other versions?
Like 12.4
Cards with higher cuda compute capabilities should support lower cuda versions
It could be a version mismatch between your graphics driver and pytorch
hmm interesting. I'm on 570.124 on Linux so could be something there. Havent tried anything in windows but maybe I'll give that a shot next
I mean cuda toolkit
What does yourr nvcc -V
Print?
well my driver situation is pretty messed up, but I still don't think that's the issue exactly. CUDA applications that explicitly support the new compute capabilities and CUDA version seem to work just fine.
Any results?
It should print 12.8 something
If their container image is 12.8 then it will be that, is the pod host 12.8(via pod create filter)?
It's all a bit more complex that I thought after more research. For now I'm just sticking with ollama locally until full explicit support for Blackwell is included in a vLLM release
https://github.com/vllm-project/vllm/issues/13306
Okay i get it
GitHub
[Feature]: Support for RTX 5090 (CUDA 12.8) ยท Issue #13306 ยท vllm...
๐ The feature, motivation and pitch Currently only nightlies from torch targeting 12.8 support blackwell such as the rtx 5090. I tried using VLLM with a rtx 5090 and no dice. Vanilla vllm installat...
huh requires custom vllm build and nightly packages nice
yeah, I haven't gone back to it but I did try this once and it didnt quite work. Going to give it another go right now I think so.
oh, this wasnt the issue/post I was following instructions from, this one is way better
damn thank you
this will probably work, now just need to find something similar for SGLang
@Sarcagian the issue mentions torch version upgrades so doing that may make sglang work too
I think this is what I was missing
thank you guys so much, seriously
Btw can you give us an update if it works?
Just in case someone has to use a B200 or a RTX 5090 to deploy vllm
will do
oh you know what? SGLang's dockerfile uses triton server as it's base image, so that's going to be inherently different. I have zero knowledge yet on Triton haha. Maybe it can be swapped for a different base image or the same that vLLM uses? Not too sure if there are specific dependencies there with triton.
If its torch based wont that solution work?
@Sarcagian why r u using sglang tho?
Im interested in building a dockerfile because i may use it in the future
honestly I'm not sure yet, but according to searches it's somehow better for tool usage, ettc whereas vLLM excels more at speeding up inference and serving simultaneous requests
SGLang is I think built from vLLM though
still pretty new to anything outside of Ollama so I'm probably not the best person to ask lol
if you do please share haha
what's your HW and model
hoenstly ollama is not bad for non-cocurrent requests
but vllm is way better (like literally 5+ times better) if the requests can be batched
yeah, I'm preparing to deploy in a high traffic prod environment though lol. So I need something a little more robust
for example
2xA40 with a 70B llama
HW?
will get about 200tok/s with batched requests
hardware you'll be deploying to
not sure quite yet which model but up to about 110b parameters as far as model sizes go. We'll be evaluating a bunch of different models initially before we decide on one
it'll likely be cloud hosted though
lol its big
r u a startup?
no more details on that I wish to share at the moment haha
anyways
110b at fp8 or int4?
very much looking forward to getting my hands on a DGX spark and/or station for this stuff soon though
probably at least fp8
smaller models that I'm looking at are in the 24-32b range and those I want to run at full fp16
I need to spend some time educating myself on the practical differences in accuracy between different quants
isnt it not enough considering it has 128gigs of vram?
the spark?
you wont be able to batch request so much
yeah
yeah I'm going to probably get a spark for dev use, looking more at the station for potential prod stuff
the memory bandwidth on the spark is pretty low, but can't beat the 128GB available for loading models
not at that price anyway
if you have many users you have to go cloud anyway
and you should have the latency & throughput requirements
cause more batched requests = more latency and more throughput
have to find a middle ground there
and determine the memory requirements based on that
GitHub
[Doc]: Steps to run vLLM on your RTX5080 or 5090! ยท Issue #14452 ...
๐ The doc issue Let's take a look at the steps required to run vLLM on your RTX5080/5090! Initial Setup: To start with, we need a container that has CUDA 12.8 and PyTorch 2.6 so that we have nv...
same actually, about to run the build
are you on docker?
just ran the image on runpod
ill try to build vllm there and if it works write the dockerfile
but the terminal doesnt work ๐ฆ
ah bummer
are you running it locally?
yeah
on build now I'm getting this:
------
Dockerfile:135
--------------------
134 | ENV CCACHE_DIR=/root/.cache/ccache
135 | >>> RUN --mount=type=cache,target=/root/.cache/ccache \
136 | >>> --mount=type=cache,target=/root/.cache/uv \
137 | >>> --mount=type=bind,source=.git,target=.git \
138 | >>> if [ "$USE_SCCACHE" != "1" ]; then \
139 | >>> # Clean any existing CMake artifacts
140 | >>> rm -rf .deps && \
141 | >>> mkdir -p .deps && \
142 | >>> python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
143 | >>> fi
144 |
-------------------- ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref I never modified this section though, nor do I quite understand what it means lol
-------------------- ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref I never modified this section though, nor do I quite understand what it means lol
apt-get update && apt-get install -y --no-install-recommends \
kmod \
git \
python3-pip \
ccache
try this
installing ccache
ah nice ty

still fairly new to docker tbh, only been working with it for like 3-4 months now
RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
&& echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
&& apt-get update -y \
&& apt-get install -y ccache software-properties-common git curl wget sudo vim python3-pip \
&& apt-get install -y ffmpeg libsm6 libxext6 libgl1 \
ccache is installed right at the top of the dockerfile though
oh
my target was wrong
trying to eventually get to the openai server image from the base
?
Are you building the image yourself or using the image from nvidia
nvidia
nvcr.io/nvidia/pytorch:25.02-py3
Is it this one?
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base
You talking about this?
Uhh
I must be lost lol
Yeah
I mean this image probably has torch with blackwell support
So the gh issue says just install vllm on top of it
I switched out the base image for the one you posted, but still getting that ccache issue
I'm still trying to modify the offical dockerfile though
I thin you dont have to do that
In here it says that image has the torch and python stuff

So you have to clone vllm and then build it with the compiler supporting blackwell
And then its done
gonna try it
try this
kk
what I'm not quite clear on, and this is just my lack of knowledge on the subject, is that as I watch the flash attn builds happen, it only appears to do sm80 and sm90?
maybe its not ready for blackwell too
but those older compute versions should still work for flash attn?
well on blackwell
hmm i dont know cuda well soo
building it now but it takes forever once it gets to the flash attn cmake steps. I did this once before and built a dockerfile based on that issue page, but I was missing the entrypoint line
I think that might be all I was missing, so I think this will do it hopefully
did u try this one?
yeah just started building that
maxjobs=10 should speed up the cmake steps I take it?
if you have 10 cores yes
oh I do
setting it much higher than the core count makes the machine sort of freeze
ah gotcha
cuz it uses all da cores for building
Clone the vLLM repository.
RUN git clone https://github.com/vllm-project/vllm.git
Change working directory to the cloned repository.
WORKDIR /tmp/vllm
had to modify it a bit to change working dir after clone
oof my bad
# syntax=docker/dockerfile:1.4
FROM nvcr.io/nvidia/pytorch:25.02-py3 as base
WORKDIR /tmp
# Install required packages.
RUN apt-get update && apt-get install -y --no-install-recommends \
kmod \
git \
python3-pip \
ccache \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
# Set environment variable required by vLLM.
ENV VLLM_FLASH_ATTN_VERSION=2
# Clone the vLLM repository.
RUN git clone https://github.com/vllm-project/vllm.git
# Change working directory to the cloned repository.
WORKDIR /tmp/vllm
# Run the preparatory script and install build dependencies.
RUN python3 use_existing_torch.py && \
pip install -r requirements/build.txt && \
pip install setuptools_scm
# Build vLLM from source in develop mode.
RUN --mount=type=cache,target=/root/.cache/ccache \
MAX_JOBS=10 CCACHE_DIR=/root/.cache/ccache \
python3 setup.py develop && \
cd /tmp && rm -rf vllm
# Test the installation by printing the vLLM version.
RUN python3 -c "import vllm; print(vllm.__version__)"
# Set the entrypoint to start the vLLM OpenAI-compatible API server.
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
now we're cookinmaybe if it works building sglang with that could work
yeah gonna try it
also i found nvidia's official(idk but it says nvidia) image for triton inference server
so that's a candidate for sglang base image
nicee
just got ssh working
it appears that torch is working with blackwell

very nice
about halfway done building my image
good sign on my side too

what build command and args did you use?
its the same (except the core count) as the dockerfile
CCACHE_DIR=/home/root/.cache/ccache python3 setup.py develop
just this
uhoh
its setup.py develop
shouldn't have deleted the code
lol
not following
"its setup.py develop
shouldn't have deleted the code"
What do you mean?
Stackoverflow says develop links the code in the repo to site packages
So if i delete the repo it might break
oh the cloned repo in the container?
yeah
ah gotcha
somehow that last build locked my PC up haha, had to start over ๐ฆ
and dont clone at /tmp if you are not gonna delete it
where should I clone to then?
/tmp gets removed (cuz obviously its temporary)
ah yeah
maybe the home folder or /workspace?
home folder will be good
workspace will do
or even /app
isnt it the network volume mount folder?
no idea, not that fluent in docker yet lol
just go with /app or /vllm then
yeah I used /app
just restarted the build
if mine finishes faster ill give you the wheel file (im building with setup.py bdist_wheel)
a wheel is just a prebuilt binary
nice
it failed
while building
@Sarcagian did u suceed
i think it needs a LOT of ram
had to restart my build, only halfway through the cmake steps
me too
how much is ur ram?
mine failed with 96gigs in runpod's RTX5090
I've got plenty haha, more than that
why did it fail tho
no idea, havent looked at the logs yet
sec
u r rich lol
no, just irresponsible with what I buy haha
i just trid to run it on a macbook
failed miserably
hahaha
had only 48gigs
๐ฆ
oof
I've only got like 25GB of RAM used up at tthe moment, that's odd it failed with that much system RAM
idk either
maybe because it had to run with rosetta

it uses almost 100gigs here XD
its hella fast tho
power of 32 vCPUs
frick

it OOMed
oh wow lol, I wonder why the RAM usage is so high?
not having anything close to that building locally
๐ฅฒ
this one selected wrong region

oh wow haha
that type is not supported lol
got one with 128vcpus and 1tb ram
FYI dont build it in develop mode, it failed on the last step, starting over again lol
im building on wheels mode
it ooms cuz of many workers so im just building with single worker
probably finishes building tomorrow
some says sglang is faster
in certain models
can vllm server multiple models simultaneously or dynamically unload/load different models as needed?
I think so with vllm api but it's not wzposwd in the serverless youll have to use pods and directly connect to vllm
Or use ports on serverless
gotcha
well I made some serious progress on getting vLLM working and built for blackwell, but after all that, it seems there's no way to compile xformers to work with torch 2.8.x dev builds so I wont be able to use models like gemma3
I realize that it takes time to develop this stuff but it's extremely frustrating that nvidia would release a new architecture, charge thousands of dollars for the GPU, and not support this part of the community especially to help develop support for sm120 and cuda 12.8
I'm beyond angry right now, I've been at this since this morning
GitHub
Please support RTX 50XX GPUs ยท Issue #1856 ยท unslothai/unsloth
It is very challenging to run on RTX 50XX GPUs on Windows. Are there any good solutions? LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32. Has anyone encountered this error?

this is like deep into the rabbit hole[
๐
have to compile every fking thing
@Sarcagian TORCH_CUDA_ARCH_LIST="12.0" pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
i cant test it cuz i dont have. a blackwell gpu
ah i can run it on runpod maybe
rtx5090 i mean you can use this gpu its blackwell too right?
yeah
his gpu is a 5090 so that should work
can just upload the wheel instead of compiling from scratch
this is weird im trying 5090 on community cloud, no system logs after like 15mins ish
and then suddenly web terminal also disconnected, but container logs are there
lol i used it yesterday and it was fine
(with a custom image tho)
i used runpod's pytorch
its not supported
with blackwell
i mean it doesnt work for blackwell cuz
cuda capability issues
hmm? oh
its 12.8 tho doesnt it means it supports blackwell too
there is an image with cuda 12.8?
oh
Yeah maybe it's a new one
when did it pop up
the reason of life just disappeared
wtf why is there an image with a different torch version than what ive built the wheel to
huh
what do you mean? maybe its a dev build?
i built the thing for torch 2.7.1 but the image is 2.8.0 so i probably cant use the wheel with that image
have to stick with nvidia's bloated
Oh man, I had just sworn off continuing to pursue this and just waiting for official support. And then you throw this at me lol. Now I'm gonna have to go back at it at least a little today.
I realized though, I've learned a ton of good info I didn't know two days ago throughout trying to solve this problem lol. So not all bad.
nah ur not fully in that rabbit hole
Hahahaha
u have to compile triton and
sglang
๐
I think I just did it haha. I'll post details later. Need sleep.