octopus Posts - Answer Overflow

octopus

•Created by octopus on 12/31/2024 in #⚡｜serverless

Settings to reduce delay time using sglang for 4bit quantized models?

I'm deploying 4bit AWQ quantized model: casperhansen/llama-3.3-70b-instruct-awq The delay time for parallel requests increases exponentially when using tsglang template. What settings I need to use to make sure the delay time is manageable?

2 replies

RRunPod

•Created by octopus on 12/25/2024 in #⚡｜serverless

Huggingface space on Serverless. How to get the Gradio API string which is the same as Worker ID?

I deployed Huggingface Space which use Gradio. If I have worker ID then I can connect to the worker usually like https://${workerID}-proxy.runpod.net/ How can I either the available workerIDs or forward my request from serverless endpoint to Gradio API which uses something like:

import { Client } from "@gradio/client";

const response_0 = await fetch("https://raw.githubusercontent.com/gradio-app/gradio/main/test/test_files/bus.png");
const exampleImage = await response_0.blob();
                        
const client = await Client.connect("https://${workerID}-7860.proxy.runpod.net/");
const result = await client.predict("/stream_chat", { 
                input_image: exampleImage, 
});

console.log(result.data);

import { Client } from "@gradio/client";

const response_0 = await fetch("https://raw.githubusercontent.com/gradio-app/gradio/main/test/test_files/bus.png");
const exampleImage = await response_0.blob();
                        
const client = await Client.connect("https://${workerID}-7860.proxy.runpod.net/");
const result = await client.predict("/stream_chat", { 
                input_image: exampleImage, 
});

console.log(result.data);

8 replies

RRunPod

•Created by octopus on 12/3/2024 in #⚡｜serverless

Not getting 100s of req/sec serving for Llama 3 70B models with default vLLM serverless template

I'm deploying Llama-70B models without quantization using 2x80GB workers but after 10 parallel requests the execution and delay time increases to 10-50sec. I'm not sure if I'm doing something wrong with my setup. I pretty much use the default setup with the vLLM template just setting MAX_MODEL_LEN to 4096 and ENFORCE_EAGER to true

1 replies

RRunPod

•Created by octopus on 11/13/2024 in #⚡｜serverless

What is the real Serverless price?

In Serverless I have 2 gpu/worker and 1 active worker. The price it shows on the main page is $0.00046/s but in the endpoint edit page it shows $0.00152/s. What is the actual price?

17 replies

RRunPod

•Created by octopus on 9/20/2024 in #⚡｜serverless

Llama-70B 3.1 execution and queue delay time much larger than 3.0. Why?

I deployed these two model who seem to be using same techniques. I'm using same machine 2x80GB but the execution time and queue delay time has massive differences: Queue delay: Llama70B 3.0: 0.02 secs Llama70B 3.1: 0.15 secs Execution time: Llama70B 3.0: 0.65 secs Llama70B 3.1: 3 secs Models: Llama 70B 3.0: https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5 Llama 70B 3.1: https://huggingface.co/mlabonne/Llama-3.1-70B-Instruct-lorablated

2 replies

RRunPod

•Created by octopus on 7/24/2024 in #⚡｜serverless

Guide to deploy Llama 405B on Serverless?

Hi, can any experts on Serverless advice on how to deploy Llama 405B on Serverless?

51 replies

RRunPod

•Created by octopus on 6/25/2024 in #⚡｜serverless

Distributing model across multiple GPUs using vLLM

vLLM has parameter TENSOR_PARALLEL_SIZE to distribute model across multiple GPUs but is this parameter supported in serverless vLLM template? I tried setting it but the inference time was the same for model running on single GPU vs multiple GPUs

10 replies

RRunPod

•Created by octopus on 6/11/2024 in #⚡｜serverless

Cannot run Cmdr+ on serverless, CohereForCausalLM not supported

I'm getting this error for all Cmdr+ models on serverless:

Error initializing vLLM engine: Model architectures ['CohereForCausalLM'] are not supported for now.

Error initializing vLLM engine: Model architectures ['CohereForCausalLM'] are not supported for now.

Although in vLLM issues we see that CohereForCausalLM is supported

8 replies

RRunPod

•Created by octopus on 6/10/2024 in #⚡｜serverless

What quantization for Cmdr+ using vLLM worker?

I'm trying to set up these Cmdr+ models on serverless using the vllm worker but the only options I see are SqueezeLLM, AWQ and GPTQ. Which quantization to set while starting these models?: https://huggingface.co/CohereForAI/c4ai-command-r-plus-4bit and https://huggingface.co/turboderp/command-r-plus-103B-exl2

12 replies

RRunPod

•Created by octopus on 5/21/2024 in #⚡｜serverless

Plans to support 400B models like llama 3?

Is runpod thinking about how they will support vvllms like 400B Llama model that is expected to release later this year?

12 replies

RRunPod

•Created by octopus on 2/29/2024 in #⚡｜serverless

Serverless calculating capacity & ideal request count vs. queue delay values

How do you calculate whether serverless worker is reaching it's capacity and what values to set for request count? I see in one of my serverless workers in production which is running regular Oobabooga (not vLLM so no concurrency) reaching 110k requests per day yesterday without starting a new worker. According to my observation my context length is usually 1000 input tokens and 10-70 output tokens which usually take between 2-5secs per request. Even if we take 1sec execution time per request it should have been able to handle only 86400 requests per day. How is it able to handle more without increasing the worker count especially when it takes 2-5secs per request?

4 replies

RRunPod

•Created by octopus on 2/26/2024 in #⚡｜serverless

Help: Serverless Mixtral OutOfMemory Error

I can't get to run Mixtral-8x7B-Instruct to run on Serverless using vLLM Runpod Worker neither for model from Mistral nor any of the quantized models Settings I'm using: GPU: 48GB (also tried 80GB) Container Image: runpod/worker-vllm:0.3.0-cuda11.8.0 Env 1: MODEL_NAME=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ (Also tried: casperhansen/mixtral-instruct-awq and TheBloke/firefly-mixtral-8x7b-GPTQ and mistralai/Mixtral-8x7B-Instruct-v0.1 Env 2: TRUST_REMOTE_CODE=1 Env 3: QUANTIZATION=awq || gptq for gptq models What am I doing wrong?? @Alpay Ariyak ERROR Log:

WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

WARNING 02-26 18:08:57 config.py:186] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
llm_engine.py:79] Initializing an LLM engine with config: model='casperhansen/mixtral-instruct-awq', tokenizer='casperhansen/mixtral-instruct-awq', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Using model weights format ['*.safetensors']
llm_engine.py:337] # GPU blocks: 7488, # CPU blocks: 2048
Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Error initializing vLLM engine: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

48 replies

RRunPod

•Created by octopus on 2/26/2024 in #⚡｜serverless

Can we add minimum GPU configs required for running the popular models like Mistral, Mixtral?

I'm trying to find what serverless GPU configs are required to run Mixtral 8x7B-Instruct either quantized (https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ) or the main from Mistral. It would be good to have this info in the ReadMe in vLLM Worker Repo. I run into OutOfMemory issues when trying it on 48GB GPU.

15 replies

Gaming

Programming