RunPod•3w ago

Serverless Requests Queuing Forever

Title says it all - I send a request to my serverless endpoint (just a test through the runpod website UI), and even though all of my workers are healthy, the request has just been sitting in the queue for over a minute. Am I being charged for time spent in queue as well as time spent on actual inference? If that's the case, then I'm burning a lot of money very fast lol. Am I doing something wrong?

178 Replies

riverfog7•3w ago

U need to specify more info like the image and model u r using

Jason•3w ago

you're charged only when a worker is running, including loading the model ( not specifically time in queue, but can be) look at the worker tab, when it's green it's running Check a worker then check the log

SarcagianOP•3w ago

understood, that clears that up at least. I'm running VLLM and attempting to use the gemma3 27b it model from googles repository

Jason•3w ago

okay

SarcagianOP•3w ago

the workers were all fully initialized and ready, but once I queued the request, they just didnt seem to do anything. I was getting a tokenizer error in the logs for the first model I tried running but didnt see any errors on the second

Jason•3w ago

and which gpu model are you using?

SarcagianOP•3w ago

I believe I had selected an H100

Jason•3w ago

ic no logs at all? if you can please export or download the logs and just send it here

SarcagianOP•3w ago

I will on my next attempt, had to move onto another project for a little while

riverfog7•3w ago

And how much did u wait? If you didnt specify a network volume, downloading models can take a long time 27b model is about 54GB

SarcagianOP•3w ago

I watched the workers complete the download in their logs live

riverfog7•3w ago

Hiw about model load Can you upload the logs here

SarcagianOP•3w ago

Like I said, I will once I go to try again. I've already removed that endpoint unfortuantely.

riverfog7•3w ago

One possibility is aold vllm version that doesnt support gemma3 The pr was merged 26days ago

riverfog7•3w ago

https://github.com/vllm-project/vllm/pull/14660

GitHub

[Model] Add support for Gemma 3 by WoosukKwon · Pull Request #1466...

This PR adds the support for Gemma 3, an open-source vision-language model from Google. NOTE: The PR doesn't implement the pan-and-scan pre-processing algorithm. It will be implemented by ...

SarcagianOP•3w ago

I was wondering about that - I just assumed that the default VLLM container on the serverless option was up to date though Can I use any container off of a registry like you can with normal pods?

riverfog7•3w ago

From what i know it needs a handler for the requests But you can always build the vllm container with the latest vllm Dockerfile should be in runpod's official repo

SarcagianOP•3w ago

haha yeah I've actually been trying to do that so I can build it with support for a 5090 which also is not going well I can build the image just fine, it's just not compiling with the right CUDA version no matter the modifications I make to the dockerfile

riverfog7•3w ago

Wdym by support for 5090

SarcagianOP•3w ago

anyway that's unrelated default image doesnt use CUDA 12.8

riverfog7•3w ago

It should be file tho because noone uses cuda 12.8 . Its too new Cuda 12.4 should work fine with a 5090 @Sarcagian whats ur max midel len

Jason•3w ago

vllm 0.8.2 supports gemma3 already what do you mean its not compiling with the right cuda version how did you check it, i thought you just choose a base image for that

riverfog7•3w ago

Yeah just looked that up

SarcagianOP•3w ago

yeah that's what I did. I'm not familiar enough yet with VLLM to give better info unfortunately. Probably going to need another few hours of trying to figure this out to get to that point lol

riverfog7•3w ago

Cuda related stuff always causes headaches 😪

Jason•3w ago

no really, how did you check it, i wanna know

SarcagianOP•3w ago

Check what specifically? The logs say as much in the container after powering it on once it's built.

Jason•3w ago

Check the cuda version "It is not compiling with the right cuda version"

SarcagianOP•3w ago

right - I changed the base image via the ARG for cuda version in the dockerfile. I'm going to go back at it later tonight but just havent had the chance yet. ARG CUDA_VERSION=12.8.1 FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base ARG CUDA_VERSION=12.8.1 ARG PYTHON_VERSION=3.12 ARG TARGETPLATFORM ENV DEBIAN_FRONTEND=noninteractive then later on theres an arg/env value called torch_cuda_arch_list which searches seem to indicate I should set to either 12.8 or 12.8.1. I believe this has something to dow ith how the flash attn modules are compiled or rather which versions of cuda they compile for but then after building the image with all of those changes, I still get the following when starting the container: NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90. If you want to use the NVIDIA GeForce RTX 5090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/ More info in this thread: https://github.com/vllm-project/vllm/issues/14452 Lots of other apps and projects out there where people are having the same issue with blackwell compatibility anyway, I just havent taken all the time needed to fully look into this nor is this what the current thread is about lol

Jason•3w ago

Try using 12.1-12.4

SarcagianOP•3w ago

12.1-12.4 are not compatible with Blackwell cards though it must be 12.8 or later

riverfog7•3w ago

Is this real?

SarcagianOP•3w ago

as far as I can tell yes I'm not familiar enough with all of the intricate details but they don't support the new sm_120 compute capabilities of the blackwell cards just yet I didn't know this until after I bought the card obviously, but tbh I'm still keeping it since I'm sure the support will come soon

SarcagianOP•3w ago

https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cuda-12-8-update-1-release-notes "This release adds compiler support for the following Nvidia Blackwell GPU architectures: SM_100 SM_101 SM_120" So actually it appears you need CUDA 12.8.1 specifically for blackwell

1. CUDA 12.8 Update 1 Release Notes — Release Notes 12.8 document...

The Release Notes for the CUDA Toolkit.

riverfog7•3w ago

Did you try other versions? Like 12.4 Cards with higher cuda compute capabilities should support lower cuda versions It could be a version mismatch between your graphics driver and pytorch

SarcagianOP•3w ago

hmm interesting. I'm on 570.124 on Linux so could be something there. Havent tried anything in windows but maybe I'll give that a shot next

riverfog7•3w ago

I mean cuda toolkit What does yourr nvcc -V Print?

SarcagianOP•3w ago

well my driver situation is pretty messed up, but I still don't think that's the issue exactly. CUDA applications that explicitly support the new compute capabilities and CUDA version seem to work just fine.

riverfog7•3w ago

Any results? It should print 12.8 something

Jason•3w ago

If their container image is 12.8 then it will be that, is the pod host 12.8(via pod create filter)?

SarcagianOP•3w ago

It's all a bit more complex that I thought after more research. For now I'm just sticking with ollama locally until full explicit support for Blackwell is included in a vLLM release

riverfog7•3w ago

https://github.com/vllm-project/vllm/issues/13306 Okay i get it

GitHub

[Feature]: Support for RTX 5090 (CUDA 12.8) · Issue #13306 · vllm...

🚀 The feature, motivation and pitch Currently only nightlies from torch targeting 12.8 support blackwell such as the rtx 5090. I tried using VLLM with a rtx 5090 and no dice. Vanilla vllm installat...

Jason•3w ago

huh requires custom vllm build and nightly packages nice

SarcagianOP•3w ago

yeah, I haven't gone back to it but I did try this once and it didnt quite work. Going to give it another go right now I think so. oh, this wasnt the issue/post I was following instructions from, this one is way better damn thank you this will probably work, now just need to find something similar for SGLang

riverfog7•3w ago

@Sarcagian the issue mentions torch version upgrades so doing that may make sglang work too

SarcagianOP•3w ago

I think this is what I was missing thank you guys so much, seriously

riverfog7•3w ago

Btw can you give us an update if it works? Just in case someone has to use a B200 or a RTX 5090 to deploy vllm

SarcagianOP•3w ago

will do oh you know what? SGLang's dockerfile uses triton server as it's base image, so that's going to be inherently different. I have zero knowledge yet on Triton haha. Maybe it can be swapped for a different base image or the same that vLLM uses? Not too sure if there are specific dependencies there with triton.

riverfog7•3w ago

If its torch based wont that solution work? @Sarcagian why r u using sglang tho? Im interested in building a dockerfile because i may use it in the future

SarcagianOP•3w ago

honestly I'm not sure yet, but according to searches it's somehow better for tool usage, ettc whereas vLLM excels more at speeding up inference and serving simultaneous requests SGLang is I think built from vLLM though still pretty new to anything outside of Ollama so I'm probably not the best person to ask lol if you do please share haha

riverfog7•3w ago

what's your HW and model hoenstly ollama is not bad for non-cocurrent requests but vllm is way better (like literally 5+ times better) if the requests can be batched

SarcagianOP•3w ago

yeah, I'm preparing to deploy in a high traffic prod environment though lol. So I need something a little more robust

riverfog7•3w ago

for example 2xA40 with a 70B llama

SarcagianOP•3w ago

HW?

riverfog7•3w ago

will get about 200tok/s with batched requests hardware you'll be deploying to

SarcagianOP•3w ago

not sure quite yet which model but up to about 110b parameters as far as model sizes go. We'll be evaluating a bunch of different models initially before we decide on one it'll likely be cloud hosted though

riverfog7•3w ago

lol its big r u a startup?

SarcagianOP•3w ago

no more details on that I wish to share at the moment haha

riverfog7•3w ago

anyways 110b at fp8 or int4?

SarcagianOP•3w ago

very much looking forward to getting my hands on a DGX spark and/or station for this stuff soon though probably at least fp8 smaller models that I'm looking at are in the 24-32b range and those I want to run at full fp16 I need to spend some time educating myself on the practical differences in accuracy between different quants

riverfog7•3w ago

isnt it not enough considering it has 128gigs of vram?

SarcagianOP•3w ago

the spark?

riverfog7•3w ago

you wont be able to batch request so much yeah

SarcagianOP•3w ago

yeah I'm going to probably get a spark for dev use, looking more at the station for potential prod stuff the memory bandwidth on the spark is pretty low, but can't beat the 128GB available for loading models not at that price anyway

riverfog7•3w ago

if you have many users you have to go cloud anyway and you should have the latency & throughput requirements cause more batched requests = more latency and more throughput have to find a middle ground there and determine the memory requirements based on that

riverfog7•3w ago

@Sarcagian https://github.com/vllm-project/vllm/issues/14452 im testing this now

GitHub

[Doc]: Steps to run vLLM on your RTX5080 or 5090! · Issue #14452 ...

📚 The doc issue Let's take a look at the steps required to run vLLM on your RTX5080/5090! Initial Setup: To start with, we need a container that has CUDA 12.8 and PyTorch 2.6 so that we have nv...

SarcagianOP•3w ago

same actually, about to run the build are you on docker?

riverfog7•3w ago

just ran the image on runpod ill try to build vllm there and if it works write the dockerfile but the terminal doesnt work 😦

SarcagianOP•3w ago

ah bummer

riverfog7•3w ago

are you running it locally?

SarcagianOP•3w ago

yeah on build now I'm getting this: ------ Dockerfile:135 -------------------- 134 | ENV CCACHE_DIR=/root/.cache/ccache 135 | >>> RUN --mount=type=cache,target=/root/.cache/ccache \ 136 | >>> --mount=type=cache,target=/root/.cache/uv \ 137 | >>> --mount=type=bind,source=.git,target=.git \ 138 | >>> if [ "$USE_SCCACHE" != "1" ]; then \ 139 | >>> # Clean any existing CMake artifacts 140 | >>> rm -rf .deps && \ 141 | >>> mkdir -p .deps && \ 142 | >>> python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \ 143 | >>> fi 144 |
-------------------- ERROR: failed to solve: failed to compute cache key: failed to calculate checksum of ref I never modified this section though, nor do I quite understand what it means lol

riverfog7•3w ago

apt-get update && apt-get install -y --no-install-recommends \ kmod \ git \ python3-pip \ ccache try this installing ccache

SarcagianOP•3w ago

ah nice ty

riverfog7•3w ago

SarcagianOP•3w ago

still fairly new to docker tbh, only been working with it for like 3-4 months now RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \ && apt-get update -y \ && apt-get install -y ccache software-properties-common git curl wget sudo vim python3-pip \ && apt-get install -y ffmpeg libsm6 libxext6 libgl1 \ ccache is installed right at the top of the dockerfile though oh my target was wrong trying to eventually get to the openai server image from the base

riverfog7•3w ago

? Are you building the image yourself or using the image from nvidia

SarcagianOP•3w ago

nvidia

riverfog7•3w ago

nvcr.io/nvidia/pytorch:25.02-py3 Is it this one?

SarcagianOP•3w ago

FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu20.04 AS base You talking about this?

riverfog7•3w ago

Uhh

SarcagianOP•3w ago

I must be lost lol

riverfog7•3w ago

Yeah I mean this image probably has torch with blackwell support So the gh issue says just install vllm on top of it

SarcagianOP•3w ago

I switched out the base image for the one you posted, but still getting that ccache issue I'm still trying to modify the offical dockerfile though

riverfog7•3w ago

I thin you dont have to do that

riverfog7•3w ago

In here it says that image has the torch and python stuff

riverfog7•3w ago

So you have to clone vllm and then build it with the compiler supporting blackwell And then its done

SarcagianOP•3w ago

gonna try it

riverfog7•3w ago

try this

FROM nvcr.io/nvidia/pytorch:25.02-py3 as base
WORKDIR /tmp
RUN apt-get update && apt-get install -y --no-install-recommends \
    kmod \
    git \
    python3-pip \
    ccache \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

ENV VLLM_FLASH_ATTN_VERSION=2

RUN git clone https://github.com/vllm-project/vllm.git && cd vllm
RUN python3 use_existing_torch.py && pip install -r requirements/build.txt && pip install setuptools_scm
RUN --mount=type=cache,target=/home/root/.cache/ccache MAX_JOBS=10 CCACHE_DIR=/home/root/.cache/ccache python3 setup.py develop && cd /tmp/ && rm -r vllm
RUN python3 -c "import vllm; print(vllm.__version__)"

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

FROM nvcr.io/nvidia/pytorch:25.02-py3 as base
WORKDIR /tmp
RUN apt-get update && apt-get install -y --no-install-recommends \
    kmod \
    git \
    python3-pip \
    ccache \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

ENV VLLM_FLASH_ATTN_VERSION=2

RUN git clone https://github.com/vllm-project/vllm.git && cd vllm
RUN python3 use_existing_torch.py && pip install -r requirements/build.txt && pip install setuptools_scm
RUN --mount=type=cache,target=/home/root/.cache/ccache MAX_JOBS=10 CCACHE_DIR=/home/root/.cache/ccache python3 setup.py develop && cd /tmp/ && rm -r vllm
RUN python3 -c "import vllm; print(vllm.__version__)"

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

SarcagianOP•3w ago

kk what I'm not quite clear on, and this is just my lack of knowledge on the subject, is that as I watch the flash attn builds happen, it only appears to do sm80 and sm90?

riverfog7•3w ago

maybe its not ready for blackwell too

SarcagianOP•3w ago

but those older compute versions should still work for flash attn? well on blackwell

riverfog7•3w ago

hmm i dont know cuda well soo

SarcagianOP•3w ago

building it now but it takes forever once it gets to the flash attn cmake steps. I did this once before and built a dockerfile based on that issue page, but I was missing the entrypoint line I think that might be all I was missing, so I think this will do it hopefully

riverfog7•3w ago

did u try this one?

SarcagianOP•3w ago

yeah just started building that maxjobs=10 should speed up the cmake steps I take it?

riverfog7•3w ago

if you have 10 cores yes

SarcagianOP•3w ago

oh I do

riverfog7•3w ago

setting it much higher than the core count makes the machine sort of freeze

SarcagianOP•3w ago

ah gotcha

riverfog7•3w ago

cuz it uses all da cores for building

SarcagianOP•3w ago

Clone the vLLM repository. RUN git clone https://github.com/vllm-project/vllm.git Change working directory to the cloned repository. WORKDIR /tmp/vllm had to modify it a bit to change working dir after clone

riverfog7•3w ago

oof my bad

SarcagianOP•3w ago

# syntax=docker/dockerfile:1.4

FROM nvcr.io/nvidia/pytorch:25.02-py3 as base
WORKDIR /tmp

# Install required packages.
RUN apt-get update && apt-get install -y --no-install-recommends \
    kmod \
    git \
    python3-pip \
    ccache \
    && apt-get clean && rm -rf /var/lib/apt/lists/*

# Set environment variable required by vLLM.
ENV VLLM_FLASH_ATTN_VERSION=2

# Clone the vLLM repository.
RUN git clone https://github.com/vllm-project/vllm.git

# Change working directory to the cloned repository.
WORKDIR /tmp/vllm

# Run the preparatory script and install build dependencies.
RUN python3 use_existing_torch.py && \
    pip install -r requirements/build.txt && \
    pip install setuptools_scm

# Build vLLM from source in develop mode.
RUN --mount=type=cache,target=/root/.cache/ccache \
    MAX_JOBS=10 CCACHE_DIR=/root/.cache/ccache \
    python3 setup.py develop && \
    cd /tmp && rm -rf vllm

# Test the installation by printing the vLLM version.
RUN python3 -c "import vllm; print(vllm.__version__)"

# Set the entrypoint to start the vLLM OpenAI-compatible API server.
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

now we're cookin

riverfog7•3w ago

maybe if it works building sglang with that could work

SarcagianOP•3w ago

yeah gonna try it

riverfog7•3w ago

also i found nvidia's official(idk but it says nvidia) image for triton inference server so that's a candidate for sglang base image

SarcagianOP•3w ago

nicee

riverfog7•3w ago

just got ssh working

riverfog7•3w ago

it appears that torch is working with blackwell

SarcagianOP•3w ago

very nice about halfway done building my image

riverfog7•3w ago

good sign on my side too

SarcagianOP•3w ago

what build command and args did you use?

riverfog7•3w ago

its the same (except the core count) as the dockerfile CCACHE_DIR=/home/root/.cache/ccache python3 setup.py develop just this uhoh its setup.py develop shouldn't have deleted the code lol

SarcagianOP•3w ago

not following "its setup.py develop shouldn't have deleted the code" What do you mean?

riverfog7•3w ago

Stackoverflow says develop links the code in the repo to site packages So if i delete the repo it might break

SarcagianOP•3w ago

oh the cloned repo in the container?

riverfog7•3w ago

yeah

SarcagianOP•3w ago

ah gotcha somehow that last build locked my PC up haha, had to start over 😦

riverfog7•3w ago

and dont clone at /tmp if you are not gonna delete it

SarcagianOP•3w ago

where should I clone to then?

riverfog7•3w ago

/tmp gets removed (cuz obviously its temporary)

SarcagianOP•3w ago

ah yeah

riverfog7•3w ago

maybe the home folder or /workspace? home folder will be good

SarcagianOP•3w ago

workspace will do or even /app

riverfog7•3w ago

isnt it the network volume mount folder?

SarcagianOP•3w ago

no idea, not that fluent in docker yet lol

riverfog7•3w ago

just go with /app or /vllm then

SarcagianOP•3w ago

yeah I used /app just restarted the build

riverfog7•3w ago

if mine finishes faster ill give you the wheel file (im building with setup.py bdist_wheel) a wheel is just a prebuilt binary

SarcagianOP•3w ago

nice

riverfog7•3w ago

it failed while building @Sarcagian did u suceed i think it needs a LOT of ram

SarcagianOP•3w ago

had to restart my build, only halfway through the cmake steps

riverfog7•3w ago

me too how much is ur ram? mine failed with 96gigs in runpod's RTX5090

SarcagianOP•3w ago

I've got plenty haha, more than that

riverfog7•3w ago

why did it fail tho

SarcagianOP•3w ago

no idea, havent looked at the logs yet sec

riverfog7•3w ago

u r rich lol

SarcagianOP•3w ago

no, just irresponsible with what I buy haha

riverfog7•3w ago

i just trid to run it on a macbook failed miserably

SarcagianOP•3w ago

hahaha

riverfog7•3w ago

had only 48gigs 😦

SarcagianOP•3w ago

oof I've only got like 25GB of RAM used up at tthe moment, that's odd it failed with that much system RAM

riverfog7•3w ago

idk either maybe because it had to run with rosetta

riverfog7•3w ago

it uses almost 100gigs here XD its hella fast tho power of 32 vCPUs

riverfog7•3w ago

frick

riverfog7•3w ago

it OOMed

SarcagianOP•3w ago

oh wow lol, I wonder why the RAM usage is so high? not having anything close to that building locally

riverfog7•3w ago

🥲

riverfog7•3w ago

this one selected wrong region

SarcagianOP•3w ago

oh wow haha

riverfog7•3w ago

that type is not supported lol got one with 128vcpus and 1tb ram

SarcagianOP•3w ago

FYI dont build it in develop mode, it failed on the last step, starting over again lol

riverfog7•3w ago

im building on wheels mode it ooms cuz of many workers so im just building with single worker probably finishes building tomorrow

Jason•3w ago

some says sglang is faster in certain models

SarcagianOP•3w ago

can vllm server multiple models simultaneously or dynamically unload/load different models as needed?

Jason•3w ago

I think so with vllm api but it's not wzposwd in the serverless youll have to use pods and directly connect to vllm Or use ports on serverless

SarcagianOP•3w ago

gotcha well I made some serious progress on getting vLLM working and built for blackwell, but after all that, it seems there's no way to compile xformers to work with torch 2.8.x dev builds so I wont be able to use models like gemma3 I realize that it takes time to develop this stuff but it's extremely frustrating that nvidia would release a new architecture, charge thousands of dollars for the GPU, and not support this part of the community especially to help develop support for sm120 and cuda 12.8 I'm beyond angry right now, I've been at this since this morning

riverfog7•3w ago

https://github.com/unslothai/unsloth/issues/1856#issuecomment-2704696518

GitHub

Please support RTX 50XX GPUs · Issue #1856 · unslothai/unsloth

It is very challenging to run on RTX 50XX GPUs on Windows. Are there any good solutions? LLVM ERROR: Cannot select: intrinsic %llvm.nvvm.shfl.sync.bfly.i32. Has anyone encountered this error?

riverfog7•3w ago

riverfog7•2w ago

this is like deep into the rabbit hole[ 😄 have to compile every fking thing @Sarcagian TORCH_CUDA_ARCH_LIST="12.0" pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers i cant test it cuz i dont have. a blackwell gpu ah i can run it on runpod maybe

Jason•2w ago

rtx5090 i mean you can use this gpu its blackwell too right?

riverfog7•2w ago

yeah his gpu is a 5090 so that should work can just upload the wheel instead of compiling from scratch

Jason•2w ago

this is weird im trying 5090 on community cloud, no system logs after like 15mins ish and then suddenly web terminal also disconnected, but container logs are there

riverfog7•2w ago

lol i used it yesterday and it was fine (with a custom image tho)

Jason•2w ago

i used runpod's pytorch

riverfog7•2w ago

its not supported with blackwell i mean it doesnt work for blackwell cuz cuda capability issues

Jason•2w ago

hmm? oh its 12.8 tho doesnt it means it supports blackwell too

riverfog7•2w ago

there is an image with cuda 12.8? oh

Jason•2w ago

Yeah maybe it's a new one

riverfog7•2w ago

when did it pop up the reason of life just disappeared wtf why is there an image with a different torch version than what ive built the wheel to

Jason•2w ago

huh what do you mean? maybe its a dev build?

riverfog7•2w ago

i built the thing for torch 2.7.1 but the image is 2.8.0 so i probably cant use the wheel with that image have to stick with nvidia's bloated

SarcagianOP•2w ago

Oh man, I had just sworn off continuing to pursue this and just waiting for official support. And then you throw this at me lol. Now I'm gonna have to go back at it at least a little today. I realized though, I've learned a ton of good info I didn't know two days ago throughout trying to solve this problem lol. So not all bad.

riverfog7•2w ago

nah ur not fully in that rabbit hole

SarcagianOP•2w ago

Hahahaha

riverfog7•2w ago

u have to compile triton and sglang 😆

SarcagianOP•7d ago

I think I just did it haha. I'll post details later. Need sleep.

Gaming

Programming

Serverless Requests Queuing Forever

Did you find this page helpful?