Casper.
Casper.
RRunPod
Created by Casper. on 1/21/2025 in #⚡|serverless
worker-vllm: Always stops after 60 seconds of streaming
Serverless is giving me this weird issue where the OpenAI stream stops after 60 seconds, but the request keeps running in the vLLM worker deployed. This results in not getting all the outputs, wasting the compute resources. The reason I want it going longer than 60 seconds is that I have a use-case for generating very long outputs. I have needed to resort to directly querying api.runpod.ai/v2. This has benefits of being able to get the job_id and do more things, but I would like to do this with the OpenAI API.
2 replies
RRunPod
Created by Casper. on 7/23/2024 in #⛅|pods
Updated Torch templates
Hi RunPod team. I write again because ever the templates on Runpod are out of date. We are lacking a torch 2.3 template for ROCm and CUDA. Tomorrow, torch 2.4 is released as well.
10 replies
RRunPod
Created by Casper. on 6/21/2024 in #⛅|pods
PyTorch 2.3: Lacking image on RunPod
Hi RunPod, please add a new PyTorch image for PyTorch 2.3.1.
9 replies
RRunPod
Created by Casper. on 6/12/2024 in #⚡|serverless
update worker-vllm to vllm 0.5.0
vLLM just got bumped to 0.5.0 with significant features being ready for production. @Alpay Ariyak FP8 is very significant but so is speculative decoding and prefix caching. - FP8 support is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost. - Add OpenAI Vision API support. Currently only LLaVA and LLaVA-NeXT are supported. - Speculative Decoding and Automatic Prefix Caching is also ready for testing, we plan to turn them on by default in upcoming releases.
4 replies
RRunPod
Created by Casper. on 2/28/2024 in #⚡|serverless
worker-vllm build fails
I am getting the following error when building the new worker-vllm image with my model.
=> ERROR [vllm-base 6/7] RUN --mount=type=secret,id=HF_TOKEN,required=false if [ -f /run/secrets/HF_TOKEN ]; then export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); fi && if [ -n "Pate 10.5s
------
> [vllm-base 6/7] RUN --mount=type=secret,id=HF_TOKEN,required=false if [ -f /run/secrets/HF_TOKEN ]; then export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); fi && if [ -n "PatentPilotAI/mistral-7b-patent-instruct-v2" ]; then python3 /download_model.py; fi:
#10 9.713 Traceback (most recent call last):
#10 9.713 File "/download_model.py", line 4, in <module>
#10 9.715 from vllm.model_executor.weight_utils import prepare_hf_model_weights, Disabledtqdm
#10 9.715 File "/vllm-installation/vllm/model_executor/__init__.py", line 2, in <module>
#10 9.715 from vllm.model_executor.model_loader import get_model
#10 9.715 File "/vllm-installation/vllm/model_executor/model_loader.py", line 10, in <module>
#10 9.715 from vllm.model_executor.weight_utils import (get_quant_config,
#10 9.715 File "/vllm-installation/vllm/model_executor/weight_utils.py", line 18, in <module>
#10 9.715 from vllm.model_executor.layers.quantization import (get_quantization_config,
#10 9.715 File "/vllm-installation/vllm/model_executor/layers/quantization/__init__.py", line 4, in <module>
#10 9.716 from vllm.model_executor.layers.quantization.awq import AWQConfig
#10 9.716 File "/vllm-installation/vllm/model_executor/layers/quantization/awq.py", line 6, in <module>
#10 9.716 from vllm._C import ops
#10 9.716 ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
=> ERROR [vllm-base 6/7] RUN --mount=type=secret,id=HF_TOKEN,required=false if [ -f /run/secrets/HF_TOKEN ]; then export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); fi && if [ -n "Pate 10.5s
------
> [vllm-base 6/7] RUN --mount=type=secret,id=HF_TOKEN,required=false if [ -f /run/secrets/HF_TOKEN ]; then export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); fi && if [ -n "PatentPilotAI/mistral-7b-patent-instruct-v2" ]; then python3 /download_model.py; fi:
#10 9.713 Traceback (most recent call last):
#10 9.713 File "/download_model.py", line 4, in <module>
#10 9.715 from vllm.model_executor.weight_utils import prepare_hf_model_weights, Disabledtqdm
#10 9.715 File "/vllm-installation/vllm/model_executor/__init__.py", line 2, in <module>
#10 9.715 from vllm.model_executor.model_loader import get_model
#10 9.715 File "/vllm-installation/vllm/model_executor/model_loader.py", line 10, in <module>
#10 9.715 from vllm.model_executor.weight_utils import (get_quant_config,
#10 9.715 File "/vllm-installation/vllm/model_executor/weight_utils.py", line 18, in <module>
#10 9.715 from vllm.model_executor.layers.quantization import (get_quantization_config,
#10 9.715 File "/vllm-installation/vllm/model_executor/layers/quantization/__init__.py", line 4, in <module>
#10 9.716 from vllm.model_executor.layers.quantization.awq import AWQConfig
#10 9.716 File "/vllm-installation/vllm/model_executor/layers/quantization/awq.py", line 6, in <module>
#10 9.716 from vllm._C import ops
#10 9.716 ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
30 replies
RRunPod
Created by Casper. on 2/6/2024 in #⚡|serverless
2x A100 / 3x 48 GB on Serverless
Hi @flash-singh, a while back we talked about having multiple GPUs on serverless and then you introduced 2x 48 GB. Now there are larger models out like Mixtral 8x7B which requires a minimum of 100GB, but ideally 120GB VRAM to serve. Do you have any plans to expand capacity to allow for this in your serverless products? Perhaps, an easier route is to allow 3x 48 GB GPUs since that can serve models like Mixtral.
4 replies
RRunPod
Created by Casper. on 2/5/2024 in #⚡|serverless
SGLang worker (similar to worker-vllm)
Recently, some progress has been made for efficiently deploying LLMs and LMMs. SGLang is up to 5x faster than vLLM. @Alpay Ariyak could we port the worker-vllm setup to SGLang? https://github.com/sgl-project/sglang https://lmsys.org/blog/2024-01-17-sglang/
11 replies
RRunPod
Created by Casper. on 1/30/2024 in #⚡|serverless
worker-vllm cannot download private model
I built my model successfully and it was able to download it during the build. However, when I deploy it on RunPod Serverless, it fails to startup upon request because it cannot download the model.
export DOCKER_BUILDKIT=1
export HF_TOKEN="your_token"

docker build -t user/app:0.0.1 \
--secret id=HF_TOKEN \
--build-arg MODEL_NAME="my_model_path" \
./worker-vllm
export DOCKER_BUILDKIT=1
export HF_TOKEN="your_token"

docker build -t user/app:0.0.1 \
--secret id=HF_TOKEN \
--build-arg MODEL_NAME="my_model_path" \
./worker-vllm
57 replies
RRunPod
Created by Casper. on 1/7/2024 in #⚡|serverless
Delay on startup: How long for low usage?
I am trying to gauge the actual cold start for a 7B LLM deployed with vLLM. My ideal configuration is something like this: 0 active workers, 5 requests/hour, and up to between 100-200 seconds of generation time. How long would it take for RunPod to do a cold start with delay time and everything? Essentially, what is the min, avg, max in terms of time to first token generated?
2 replies