richterscale9 Comments - Answer Overflow

richterscale9

Posts Comments

RRunPod

•Created by richterscale9 on 5/13/2024 in #⚡｜serverless

Serverless vLLM doesn't work and gives no error message

GPTQ is 4-bit quantized and the same model fits in 12GB VRAM in oobabooga

12 replies

RRunPod

•Created by richterscale9 on 5/13/2024 in #⚡｜serverless

Serverless vLLM doesn't work and gives no error message

Regarding 3., I finally got a "bad cold start" that I was trying to get in my experiments. It was 45 seconds. I assume this means that my container & weights were on disk available for more than one host, and the 45 seconds was spent spinning up the container and loading model weights into VRAM. 45 seconds is too fast to download weights, at least.

12 replies

RRunPod

•Created by richterscale9 on 5/13/2024 in #⚡｜serverless

Serverless vLLM doesn't work and gives no error message

Thanks! 1. I'm trying to run TheBloke/airoboros-l2-13B-gpt4-1.4.1-GPTQ on vLLM and it's going OOM when I have the 24GB VRAM GPU selected. Seems like 48GB VRAM is the smallest that it runs in. The same model fits in under 12GB when I run it in oobabooga as opposed to vLLM. I'm wondering if this is expected behavior that vLLM requires so much more VRAM for the same model? 2. The output quality of the model seems to be degraded (when comparing the same model on vLLM to oobabooga). For example, asking "What is the capital of France?" sometimes yields "", sometimes "<<SYS>>", sometimes "< Paris\n\n". I suspect that vLLM is not using the correct prompt template. The recommended API has these abstractions that prevent me from seeing what actually goes into the LLM. I was hoping to find a way to input the entire prompt into vLLM including special tokens, but I haven't found any way to do that. I don't know if that's possible without modifying vLLM source code? 3. I'm trying to understand how FlashBoot works. It's so fast that it must be holding the model weights in VRAM, and it's obviously not possible for RunPod to hold everybody's stuff in VRAM all the time. What happens when I get a "bad cold start" and my workload is executed on a new instance? Is it going to take 15 minutes to spin up, like it took for the first instance? Or am I going to experience a much more reasonable cold start like maybe 20 seconds to load the weights from disk to VRAM?

12 replies

RRunPod

•Created by richterscale9 on 5/13/2024 in #⚡｜serverless

Serverless vLLM doesn't work and gives no error message

Edit: I'm still having problems but I think my current issues are related to vLLM and not RunPod so maybe I shouldn't bug you with these issues.

12 replies

RRunPod

•Created by richterscale9 on 5/13/2024 in #⚡｜serverless

Serverless vLLM doesn't work and gives no error message

It looks like the problem was using the default timeout value of 600. The initialization appeared to take longer than 600 seconds, so the instance would get rebooted every 600 seconds and never finish initialization. After I increased the timeout, it now finished initialization and gave me a different error (GPU VRAM exceeded, I will try with different sized model).

12 replies

RRunPod

•Created by md on 5/12/2024 in #⚡｜serverless

Run Mixtral 8x22B Instruct on vLLM worker

Yeah, reading the docs right now to figure out what is everything i need to do to try it... I currently have a Docker image that spins up a fork of oobabooga web ui, I'm thinking about setting that up for the serverless experiment.

126 replies

RRunPod

•Created by md on 5/12/2024 in #⚡｜serverless

Run Mixtral 8x22B Instruct on vLLM worker

That's just insane if it really works

126 replies

RRunPod

•Created by md on 5/12/2024 in #⚡｜serverless

Run Mixtral 8x22B Instruct on vLLM worker

Does this 250ms cold boot time really include everything? Or does it only contain some things, such that the actual cold boot time might be 30 seconds or something? For example, the time to load LLM weights into memory typically takes more than 10 seconds.

126 replies

RRunPod

•Created by md on 5/12/2024 in #⚡｜serverless

Run Mixtral 8x22B Instruct on vLLM worker

Hey, sorry to hijack the thread, I'm also looking into deploying vLLM on RunPod serverless. The landing page indicates that it should be possible to bring your own container, not pay for any idle time, and have <250ms cold boot. Is this true? It sounds too good to be true.

126 replies

Gaming

Programming