Casper. Posts - Answer Overflow

Casper.

•Created by Casper. on 1/25/2025 in #⚡｜serverless

worker-vllm not working with beam search

Hi, I found another bug in your worker-vllm. Beam search is not supported even though your README says it is. This time around it's length_penalty not being accepted. Can you please work on a fix for beam search? Thanks!

4 replies

RRunPod

•Created by Casper. on 1/21/2025 in #⚡｜serverless

worker-vllm: Always stops after 60 seconds of streaming

Serverless is giving me this weird issue where the OpenAI stream stops after 60 seconds, but the request keeps running in the vLLM worker deployed. This results in not getting all the outputs, wasting the compute resources. The reason I want it going longer than 60 seconds is that I have a use-case for generating very long outputs. I have needed to resort to directly querying api.runpod.ai/v2. This has benefits of being able to get the job_id and do more things, but I would like to do this with the OpenAI API.

2 replies

RRunPod

•Created by Casper. on 7/23/2024 in #⛅｜pods

Updated Torch templates

Hi RunPod team. I write again because ever the templates on Runpod are out of date. We are lacking a torch 2.3 template for ROCm and CUDA. Tomorrow, torch 2.4 is released as well.

10 replies

RRunPod

•Created by Casper. on 6/21/2024 in #⛅｜pods

PyTorch 2.3: Lacking image on RunPod

Hi RunPod, please add a new PyTorch image for PyTorch 2.3.1.

9 replies

RRunPod

•Created by Casper. on 6/12/2024 in #⚡｜serverless

update worker-vllm to vllm 0.5.0

vLLM just got bumped to 0.5.0 with significant features being ready for production. @Alpay Ariyak FP8 is very significant but so is speculative decoding and prefix caching. - FP8 support is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost. - Add OpenAI Vision API support. Currently only LLaVA and LLaVA-NeXT are supported. - Speculative Decoding and Automatic Prefix Caching is also ready for testing, we plan to turn them on by default in upcoming releases.

4 replies

RRunPod

•Created by Casper. on 2/28/2024 in #⚡｜serverless

worker-vllm build fails

I am getting the following error when building the new worker-vllm image with my model.

 => ERROR [vllm-base 6/7] RUN --mount=type=secret,id=HF_TOKEN,required=false     if [ -f /run/secrets/HF_TOKEN ]; then         export HF_TOKEN=$(cat /run/secrets/HF_TOKEN);     fi &&     if [ -n "Pate  10.5s
------
 > [vllm-base 6/7] RUN --mount=type=secret,id=HF_TOKEN,required=false     if [ -f /run/secrets/HF_TOKEN ]; then         export HF_TOKEN=$(cat /run/secrets/HF_TOKEN);     fi &&     if [ -n "PatentPilotAI/mistral-7b-patent-instruct-v2" ]; then         python3 /download_model.py;     fi:
#10 9.713 Traceback (most recent call last):
#10 9.713   File "/download_model.py", line 4, in <module>
#10 9.715     from vllm.model_executor.weight_utils import prepare_hf_model_weights, Disabledtqdm
#10 9.715   File "/vllm-installation/vllm/model_executor/__init__.py", line 2, in <module>
#10 9.715     from vllm.model_executor.model_loader import get_model
#10 9.715   File "/vllm-installation/vllm/model_executor/model_loader.py", line 10, in <module>
#10 9.715     from vllm.model_executor.weight_utils import (get_quant_config,
#10 9.715   File "/vllm-installation/vllm/model_executor/weight_utils.py", line 18, in <module>
#10 9.715     from vllm.model_executor.layers.quantization import (get_quantization_config,
#10 9.715   File "/vllm-installation/vllm/model_executor/layers/quantization/__init__.py", line 4, in <module>
#10 9.716     from vllm.model_executor.layers.quantization.awq import AWQConfig
#10 9.716   File "/vllm-installation/vllm/model_executor/layers/quantization/awq.py", line 6, in <module>
#10 9.716     from vllm._C import ops
#10 9.716 ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

 => ERROR [vllm-base 6/7] RUN --mount=type=secret,id=HF_TOKEN,required=false     if [ -f /run/secrets/HF_TOKEN ]; then         export HF_TOKEN=$(cat /run/secrets/HF_TOKEN);     fi &&     if [ -n "Pate  10.5s
------
 > [vllm-base 6/7] RUN --mount=type=secret,id=HF_TOKEN,required=false     if [ -f /run/secrets/HF_TOKEN ]; then         export HF_TOKEN=$(cat /run/secrets/HF_TOKEN);     fi &&     if [ -n "PatentPilotAI/mistral-7b-patent-instruct-v2" ]; then         python3 /download_model.py;     fi:
#10 9.713 Traceback (most recent call last):
#10 9.713   File "/download_model.py", line 4, in <module>
#10 9.715     from vllm.model_executor.weight_utils import prepare_hf_model_weights, Disabledtqdm
#10 9.715   File "/vllm-installation/vllm/model_executor/__init__.py", line 2, in <module>
#10 9.715     from vllm.model_executor.model_loader import get_model
#10 9.715   File "/vllm-installation/vllm/model_executor/model_loader.py", line 10, in <module>
#10 9.715     from vllm.model_executor.weight_utils import (get_quant_config,
#10 9.715   File "/vllm-installation/vllm/model_executor/weight_utils.py", line 18, in <module>
#10 9.715     from vllm.model_executor.layers.quantization import (get_quantization_config,
#10 9.715   File "/vllm-installation/vllm/model_executor/layers/quantization/__init__.py", line 4, in <module>
#10 9.716     from vllm.model_executor.layers.quantization.awq import AWQConfig
#10 9.716   File "/vllm-installation/vllm/model_executor/layers/quantization/awq.py", line 6, in <module>
#10 9.716     from vllm._C import ops
#10 9.716 ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

30 replies

RRunPod

•Created by Casper. on 2/6/2024 in #⚡｜serverless

2x A100 / 3x 48 GB on Serverless

Hi @flash-singh, a while back we talked about having multiple GPUs on serverless and then you introduced 2x 48 GB. Now there are larger models out like Mixtral 8x7B which requires a minimum of 100GB, but ideally 120GB VRAM to serve. Do you have any plans to expand capacity to allow for this in your serverless products? Perhaps, an easier route is to allow 3x 48 GB GPUs since that can serve models like Mixtral.

4 replies

RRunPod

•Created by Casper. on 2/5/2024 in #⚡｜serverless

SGLang worker (similar to worker-vllm)

Recently, some progress has been made for efficiently deploying LLMs and LMMs. SGLang is up to 5x faster than vLLM. @Alpay Ariyak could we port the worker-vllm setup to SGLang? https://github.com/sgl-project/sglang https://lmsys.org/blog/2024-01-17-sglang/

11 replies

RRunPod

•Created by Casper. on 1/30/2024 in #⚡｜serverless

worker-vllm cannot download private model

I built my model successfully and it was able to download it during the build. However, when I deploy it on RunPod Serverless, it fails to startup upon request because it cannot download the model.

export DOCKER_BUILDKIT=1
export HF_TOKEN="your_token"

docker build -t user/app:0.0.1 \
    --secret id=HF_TOKEN \
    --build-arg MODEL_NAME="my_model_path" \
    ./worker-vllm

export DOCKER_BUILDKIT=1
export HF_TOKEN="your_token"

docker build -t user/app:0.0.1 \
    --secret id=HF_TOKEN \
    --build-arg MODEL_NAME="my_model_path" \
    ./worker-vllm

57 replies

RRunPod

•Created by Casper. on 1/7/2024 in #⚡｜serverless

Delay on startup: How long for low usage?

I am trying to gauge the actual cold start for a 7B LLM deployed with vLLM. My ideal configuration is something like this: 0 active workers, 5 requests/hour, and up to between 100-200 seconds of generation time. How long would it take for RunPod to do a cold start with delay time and everything? Essentially, what is the min, avg, max in terms of time to first token generated?

2 replies

Gaming

Programming