octopus
RRunPod
•Created by octopus on 11/13/2024 in #⚡|serverless
What is the real Serverless price?
In Serverless I have 2 gpu/worker and 1 active worker. The price it shows on the main page is $0.00046/s but in the endpoint edit page it shows $0.00152/s. What is the actual price?
16 replies
RRunPod
•Created by octopus on 9/20/2024 in #⚡|serverless
Llama-70B 3.1 execution and queue delay time much larger than 3.0. Why?
I deployed these two model who seem to be using same techniques. I'm using same machine 2x80GB but the execution time and queue delay time has massive differences:
Queue delay:
Llama70B 3.0: 0.02 secs
Llama70B 3.1: 0.15 secs
Execution time:
Llama70B 3.0: 0.65 secs
Llama70B 3.1: 3 secs
Models:
Llama 70B 3.0: https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5
Llama 70B 3.1: https://huggingface.co/mlabonne/Llama-3.1-70B-Instruct-lorablated
2 replies
RRunPod
•Created by octopus on 7/24/2024 in #⚡|serverless
Guide to deploy Llama 405B on Serverless?
Hi, can any experts on Serverless advice on how to deploy Llama 405B on Serverless?
51 replies
RRunPod
•Created by octopus on 6/25/2024 in #⚡|serverless
Distributing model across multiple GPUs using vLLM
vLLM has parameter TENSOR_PARALLEL_SIZE to distribute model across multiple GPUs but is this parameter supported in serverless vLLM template? I tried setting it but the inference time was the same for model running on single GPU vs multiple GPUs
10 replies
RRunPod
•Created by octopus on 6/11/2024 in #⚡|serverless
Cannot run Cmdr+ on serverless, CohereForCausalLM not supported
I'm getting this error for all Cmdr+ models on serverless:
Although in vLLM issues we see that CohereForCausalLM is supported
8 replies
RRunPod
•Created by octopus on 6/10/2024 in #⚡|serverless
What quantization for Cmdr+ using vLLM worker?
I'm trying to set up these Cmdr+ models on serverless using the vllm worker but the only options I see are SqueezeLLM, AWQ and GPTQ. Which quantization to set while starting these models?:
https://huggingface.co/CohereForAI/c4ai-command-r-plus-4bit
and
https://huggingface.co/turboderp/command-r-plus-103B-exl2
12 replies
RRunPod
•Created by octopus on 5/21/2024 in #⚡|serverless
Plans to support 400B models like llama 3?
Is runpod thinking about how they will support vvllms like 400B Llama model that is expected to release later this year?
12 replies
RRunPod
•Created by octopus on 2/29/2024 in #⚡|serverless
Serverless calculating capacity & ideal request count vs. queue delay values
How do you calculate whether serverless worker is reaching it's capacity and what values to set for request count? I see in one of my serverless workers in production which is running regular Oobabooga (not vLLM so no concurrency) reaching 110k requests per day yesterday without starting a new worker.
According to my observation my context length is usually 1000 input tokens and 10-70 output tokens which usually take between 2-5secs per request. Even if we take 1sec execution time per request it should have been able to handle only 86400 requests per day.
How is it able to handle more without increasing the worker count especially when it takes 2-5secs per request?
4 replies
RRunPod
•Created by octopus on 2/26/2024 in #⚡|serverless
Help: Serverless Mixtral OutOfMemory Error
I can't get to run Mixtral-8x7B-Instruct to run on Serverless using vLLM Runpod Worker neither for model from Mistral nor any of the quantized models
Settings I'm using:
GPU:
48GB (also tried 80GB)
Container Image: runpod/worker-vllm:0.3.0-cuda11.8.0
Env 1: MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
(Also tried: casperhansen/mixtral-instruct-awq
and TheBloke/firefly-mixtral-8x7b-GPTQ
and mistralai/Mixtral-8x7B-Instruct-v0.1
Env 2: TRUST_REMOTE_CODE=1
Env 3: QUANTIZATION=awq
|| gptq
for gptq models
What am I doing wrong?? @Alpay Ariyak
ERROR Log:
48 replies
RRunPod
•Created by octopus on 2/26/2024 in #⚡|serverless
Can we add minimum GPU configs required for running the popular models like Mistral, Mixtral?
I'm trying to find what serverless GPU configs are required to run Mixtral 8x7B-Instruct either quantized (https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ) or the main from Mistral. It would be good to have this info in the ReadMe in vLLM Worker Repo.
I run into OutOfMemory issues when trying it on 48GB GPU.
15 replies