Rundpod VLLM Cuda out of Memory
Hi I've been using the default runpod VLLM template with the mixtrial model loaded in the network volume. I'm encountering CUDA out of memory on cold starts.
Here is the error log.
2024-01-15T20:32:13.726720287Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 47.54 GiB of which 16.75 MiB is free. Process 422202 has 47.51 GiB memory in use. Of the allocated memory 47.05 GiB is allocated by PyTorch, and 12.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
57 Replies
Which Mixtral model?
mistralai/Mixtral-8x7B-v0.1
Thats too big to fit into 48GB, you need 2 x A100 for it, you should look at using a quantized version instead, such as TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ
This version is also uncensored
Does this look correct?
Mixtral is just too much of a memory hog
Yeah looks fine
2024-01-15T20:52:23.809750811Z File "/handler.py", line 7, in <module>
2024-01-15T20:52:23.809891157Z vllm_engine = VLLMEngine()
2024-01-15T20:52:23.810037390Z ^^^^^^^^^^^^
2024-01-15T20:52:23.810098653Z File "/engine.py", line 38, in init
2024-01-15T20:52:23.810218453Z self.llm = self._initialize_llm()
2024-01-15T20:52:23.810380946Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.810389493Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T20:52:23.810576492Z raise e
2024-01-15T20:52:23.810592982Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T20:52:23.810735102Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(self.config))
2024-01-15T20:52:23.811013662Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.811046045Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T20:52:23.811394368Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T20:52:23.811521064Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.811549097Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init
2024-01-15T20:52:23.811785870Z self.engine = self._init_engine(*args, kwargs)
2024-01-15T20:52:23.811983677Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.812010660Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T20:52:23.812252599Z return engine_class(*args, **kwargs)
2024-01-15T20:52:23.812439826Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T20:52:23.812447822Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init
2024-01-15T20:52:23.812621912Z self._init_workers(distributed_init_method)
2024-01-15T20:52:23.812687615Z File "/src/vllm/vllm/engine/llm_engine.py", line 146, in _init_workers
2024-01-15T20:52:23.812863558Z self._run_workers(
Looks like I'm getting a disk quota exceeded
Is your network volume full? Or didn't you add the other environment variables for huggingface cache etc?
prob increase ur container volume higher too - 5 is tiny
Not necessary if the environment variables are set correctly
5GB is enough
I increased my network volume and got rid of the that problem. I'll probably wipe my network volume so it doesn't have the old model on there anymore.
My jobs are getting stuck here but the model is loading in fine.
Use GPTQ not AWQ
I sent you this one
TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ
not sure why you changed it to AWQ
Changed it because of this lol oops
Oh, don't know why the README says that because your screenshot says AWQ quantization is not fully optimized yet π€·ββοΈ
IT says the same with gptq
Oh okay, my bad sorry, AWQ is probably better then.
AWQ Seems to be faulty too. CUDA seems to be breaking.
trust_remote_code
needs to be set to TRUE for Mixtral, not sure whether thats causing the issue.I don't see that as an enviornment var in the readme
Might need to fork it and add it yourself π . Are you still using 48GB GPU tier?
Yes.
2024-01-15T21:32:29.206204204Z INFO 01-15 21:32:29 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mistral-7B-v0.1', tokenizer='mistralai/Mistral-7B-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
2024-01-15T21:32:29.587369692Z engine.py :56 2024-01-15 21:32:29,586 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:32:29.587459992Z Traceback (most recent call last):
2024-01-15T21:32:29.587477182Z File "/handler.py", line 7, in <module>
2024-01-15T21:32:29.587597751Z vllm_engine = VLLMEngine()
2024-01-15T21:32:29.587676721Z ^^^^^^^^^^^^
2024-01-15T21:32:29.587687568Z File "/engine.py", line 38, in init
2024-01-15T21:32:29.587846707Z self.llm = self._initialize_llm()
2024-01-15T21:32:29.587920123Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.587927757Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:32:29.588049626Z raise e
2024-01-15T21:32:29.588066343Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:32:29.588169416Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(self.config))
2024-01-15T21:32:29.588340362Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.588362955Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:32:29.588594264Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:32:29.588675837Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.588704403Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init
2024-01-15T21:32:29.588857283Z self.engine = self._init_engine(*args, kwargs)
2024-01-15T21:32:29.588974769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589017799Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:32:29.589179321Z return engine_class(*args, kwargs)
2024-01-15T21:32:29.589276287Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589306141Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init
2024-01-15T21:32:29.589436730Z self._init_workers(distributed_init_method)
2024-01-15T21:32:29.589445070Z File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:32:29.589570340Z self._run_workers(
2024-01-15T21:32:29.589578206Z File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:32:29.589964835Z self._run_workers_in_batch(workers, method, *args, kwargs))
2024-01-15T21:32:29.589993004Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.589998521Z File "/src/vllm/vllm/engine/llm_engine.py", line 737, in _run_workers_in_batch
2024-01-15T21:32:29.590353319Z output = executor(*args, **kwargs)
2024-01-15T21:32:29.590387619Z ^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:32:29.590392816Z File "/src/vllm/vllm/worker/worker.py", line 67, in init_model
2024-01-15T21:32:29.590540725Z torch.cuda.set_device(self.device)
2024-01-15T21:32:29.590547462Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 404, in set_device
2024-01-15T21:32:29.590728911Z torch._C._cuda_setDevice(device)
2024-01-15T21:32:29.590792281Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 298, in _lazy_init
2024-01-15T21:32:29.590940554Z torch._C._cuda_init()
2024-01-15T21:32:29.590948904Z RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu.
Tried using base default mistral and getting CUDA errors still lol.
Probably related to
trust_remote_code
, it has to be true for Mixtral.It worked before which is super weird.
The CUDA error was also for mistral
Oh yeah thats strange then, didn't realise it was working
Which mistral model?
mistralai/Mistral-7B-v0.1
Thats a pretty small model so there shouldn't be issues
Yeah not sure how this is giving me CUDA errors.
Probably need to log a Github issue for it.
https://github.com/runpod-workers/worker-vllm/issues
GitHub
Issues Β· runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - Issues Β· runpod-workers/worker-vllm
Still just keep getting st uck at this stage.
Hey! I had a similar issue with loading awq models with this worker. I resolved it by setting
GPU_MEMORY_UTILIZATION
variable to 0.90.
One more thing. It's recommended to use CUDA verson of 12.1. Try to change it by setting env variable WORKER_CUDA_VERSION
to 12.1
I'm not sure but you should probably change it in the Dockerfile. Setting it as an env variable probably won't work. (I may be wrong.)Yeah we need the CUDA version filter for serverless like GPU cloud has.
Is that an environment vaariable?
Just kidding found it. @antoniog Also wondering if you baked your model into the docker image? The spin up time while using network volume is quite slow.
@Alpay Ariyak if you get a chance to glance this over.
Can't seem to use 12.1
It can't find the branch for . Should I just have it use 11.8?
Sorry, it's not that intuitive, but if you build from the main branch with --build-arg WORKER_CUDA_VERSION=12.1, it will correctly install everything for 12.1, so you dont need to modify the dockerfile
Okay thank you!
What are some missing features and issues that made you build your own images rather than use the pre-built one? @here
I'm baking in my model to test out if it will make my workers faster. When using the default image, I'm getting delay times of up to 700s for it to load the model.
With the long delay time, it spins up other workers thus increasing my cost.
I'm also trying the solution that @antoniog gave of changing the GPU Util value
Error log of a fresh fork from the repo.
No module named 'numpy'.
I'm trying to use the prebuilt now with 0.1.0. Will report back if it works
This is with the base image of 0.1.0 using OpenChat with no other env variables. It also spun up 3 other workers to accomplish this job.
Ran it again and it worked much faster hmm.
Different worker
Yes, it will always download the model during the first request
So the requirement to build an image yourself is linux/ubuntu and an nvidia gpu, which seems to be the issue here
How long does that stay loaded in the worker?
It will depend on if you have flashboot on - with it, load times have been under 2s for me through relatively long periods of time
I do have flash boot on. Iβm just worried about those first request load times. Which was why I looked into baking the model into the image
This is with flasahboot on
Also don't want to rely on network storage since it reduces the amount of avaliable GPUs I can use.
I think for llama 7b our load times were around 22 seconds on the machine directly
Weβre working on improving the speed of model loading in vLLM
what context are u running at
not only do you need to have space for fp16 weights u need space for context, which should be about 2 to 3gb for 4k to 8k context
So baking in the model doesn't help speed up model loading?
Option 2.