octopus
octopus
RRunPod
Created by octopus on 6/25/2024 in #⚡|serverless
Distributing model across multiple GPUs using vLLM
vLLM has parameter TENSOR_PARALLEL_SIZE to distribute model across multiple GPUs but is this parameter supported in serverless vLLM template? I tried setting it but the inference time was the same for model running on single GPU vs multiple GPUs
10 replies
RRunPod
Created by octopus on 6/11/2024 in #⚡|serverless
Cannot run Cmdr+ on serverless, CohereForCausalLM not supported
I'm getting this error for all Cmdr+ models on serverless:
Error initializing vLLM engine: Model architectures ['CohereForCausalLM'] are not supported for now.
Error initializing vLLM engine: Model architectures ['CohereForCausalLM'] are not supported for now.
Although in vLLM issues we see that CohereForCausalLM is supported
8 replies
RRunPod
Created by octopus on 6/10/2024 in #⚡|serverless
What quantization for Cmdr+ using vLLM worker?
I'm trying to set up these Cmdr+ models on serverless using the vllm worker but the only options I see are SqueezeLLM, AWQ and GPTQ. Which quantization to set while starting these models?: https://huggingface.co/CohereForAI/c4ai-command-r-plus-4bit and https://huggingface.co/turboderp/command-r-plus-103B-exl2
12 replies
RRunPod
Created by octopus on 5/21/2024 in #⚡|serverless
Plans to support 400B models like llama 3?
Is runpod thinking about how they will support vvllms like 400B Llama model that is expected to release later this year?
12 replies
RRunPod
Created by octopus on 2/29/2024 in #⚡|serverless
Serverless calculating capacity & ideal request count vs. queue delay values
How do you calculate whether serverless worker is reaching it's capacity and what values to set for request count? I see in one of my serverless workers in production which is running regular Oobabooga (not vLLM so no concurrency) reaching 110k requests per day yesterday without starting a new worker. According to my observation my context length is usually 1000 input tokens and 10-70 output tokens which usually take between 2-5secs per request. Even if we take 1sec execution time per request it should have been able to handle only 86400 requests per day. How is it able to handle more without increasing the worker count especially when it takes 2-5secs per request?
4 replies
RRunPod
Created by octopus on 2/26/2024 in #⚡|serverless
Help: Serverless Mixtral OutOfMemory Error
I can't get to run Mixtral-8x7B-Instruct to run on Serverless using vLLM Runpod Worker neither for model from Mistral nor any of the quantized models Settings I'm using: GPU: 48GB (also tried 80GB) Container Image: runpod/worker-vllm:0.3.0-cuda11.8.0 Env 1: MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ (Also tried: casperhansen/mixtral-instruct-awq and TheBloke/firefly-mixtral-8x7b-GPTQ and mistralai/Mixtral-8x7B-Instruct-v0.1 Env 2: TRUST_REMOTE_CODE=1 Env 3: QUANTIZATION=awq || gptq for gptq models What am I doing wrong?? @Alpay Ariyak ERROR Log:
WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
48 replies
RRunPod
Created by octopus on 2/26/2024 in #⚡|serverless
Can we add minimum GPU configs required for running the popular models like Mistral, Mixtral?
I'm trying to find what serverless GPU configs are required to run Mixtral 8x7B-Instruct either quantized (https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ) or the main from Mistral. It would be good to have this info in the ReadMe in vLLM Worker Repo. I run into OutOfMemory issues when trying it on 48GB GPU.
15 replies