RunPod•16mo ago

Rundpod VLLM Cuda out of Memory

Hi I've been using the default runpod VLLM template with the mixtrial model loaded in the network volume. I'm encountering CUDA out of memory on cold starts. Here is the error log. 2024-01-15T20:32:13.726720287Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 47.54 GiB of which 16.75 MiB is free. Process 422202 has 47.51 GiB memory in use. Of the allocated memory 47.05 GiB is allocated by PyTorch, and 12.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

message.txt

57 Replies

ashleyk•16mo ago

Which Mixtral model?

ConceptOP•16mo ago

mistralai/Mixtral-8x7B-v0.1

ashleyk•16mo ago

Thats too big to fit into 48GB, you need 2 x A100 for it, you should look at using a quantized version instead, such as TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ This version is also uncensored

ConceptOP•16mo ago

Does this look correct?

J.•16mo ago

Mixtral is just too much of a memory hog

ashleyk•16mo ago

Yeah looks fine

ConceptOP•16mo ago

2024-01-15T20:52:23.809750811Z File "/handler.py", line 7, in <module> 2024-01-15T20:52:23.809891157Z vllm_engine = VLLMEngine() 2024-01-15T20:52:23.810037390Z ^^^^^^^^^^^^ 2024-01-15T20:52:23.810098653Z File "/engine.py", line 38, in init 2024-01-15T20:52:23.810218453Z self.llm = self._initialize_llm() 2024-01-15T20:52:23.810380946Z ^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T20:52:23.810389493Z File "/engine.py", line 57, in _initialize_llm 2024-01-15T20:52:23.810576492Z raise e 2024-01-15T20:52:23.810592982Z File "/engine.py", line 54, in _initialize_llm 2024-01-15T20:52:23.810735102Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(self.config)) 2024-01-15T20:52:23.811013662Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T20:52:23.811046045Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args 2024-01-15T20:52:23.811394368Z engine = cls(parallel_config.worker_use_ray, 2024-01-15T20:52:23.811521064Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T20:52:23.811549097Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init 2024-01-15T20:52:23.811785870Z self.engine = self._init_engine(*args, kwargs) 2024-01-15T20:52:23.811983677Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T20:52:23.812010660Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine 2024-01-15T20:52:23.812252599Z return engine_class(*args, **kwargs) 2024-01-15T20:52:23.812439826Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T20:52:23.812447822Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init 2024-01-15T20:52:23.812621912Z self._init_workers(distributed_init_method) 2024-01-15T20:52:23.812687615Z File "/src/vllm/vllm/engine/llm_engine.py", line 146, in _init_workers 2024-01-15T20:52:23.812863558Z self._run_workers( Looks like I'm getting a disk quota exceeded

ashleyk•16mo ago

Is your network volume full? Or didn't you add the other environment variables for huggingface cache etc?

J.•16mo ago

prob increase ur container volume higher too - 5 is tiny

ashleyk•16mo ago

Not necessary if the environment variables are set correctly 5GB is enough

ConceptOP•16mo ago

I increased my network volume and got rid of the that problem. I'll probably wipe my network volume so it doesn't have the old model on there anymore.

ConceptOP•16mo ago

My jobs are getting stuck here but the model is loading in fine.

ashleyk•16mo ago

Use GPTQ not AWQ I sent you this one TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ not sure why you changed it to AWQ

ConceptOP•16mo ago

Changed it because of this lol oops

ConceptOP•16mo ago

ashleyk•16mo ago

Oh, don't know why the README says that because your screenshot says AWQ quantization is not fully optimized yet 🤷‍♂️

ConceptOP•16mo ago

IT says the same with gptq

ConceptOP•16mo ago

ashleyk•16mo ago

Oh okay, my bad sorry, AWQ is probably better then.

ConceptOP•16mo ago

config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|██████████| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py           :56   2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z   File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z     vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z                   ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z   File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z     self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z                ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z   File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z     raise e
2024-01-15T21:25:12.969535724Z   File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z     return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z     engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z     self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z     return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z   File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z     self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z   File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z     self._run_workers(
2024-01-15T21:25:12.971006647Z   File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z     self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|██████████| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py           :56   2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z   File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z     vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z                   ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z   File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z     self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z                ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z   File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z     raise e
2024-01-15T21:25:12.969535724Z   File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z     return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z     engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z     self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z   File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z     return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z   File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z     self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z   File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z     self._run_workers(
2024-01-15T21:25:12.971006647Z   File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z     self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AWQ Seems to be faulty too. CUDA seems to be breaking.

ashleyk•16mo ago

trust_remote_code needs to be set to TRUE for Mixtral, not sure whether thats causing the issue.

ConceptOP•16mo ago

I don't see that as an enviornment var in the readme

ashleyk•16mo ago

Might need to fork it and add it yourself 🙈 . Are you still using 48GB GPU tier?

ConceptOP•16mo ago

Yes. 2024-01-15T21:32:29.206204204Z INFO 01-15 21:32:29 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mistral-7B-v0.1', tokenizer='mistralai/Mistral-7B-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0) 2024-01-15T21:32:29.587369692Z engine.py :56 2024-01-15 21:32:29,586 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu. 2024-01-15T21:32:29.587459992Z Traceback (most recent call last): 2024-01-15T21:32:29.587477182Z File "/handler.py", line 7, in <module> 2024-01-15T21:32:29.587597751Z vllm_engine = VLLMEngine() 2024-01-15T21:32:29.587676721Z ^^^^^^^^^^^^ 2024-01-15T21:32:29.587687568Z File "/engine.py", line 38, in init 2024-01-15T21:32:29.587846707Z self.llm = self._initialize_llm() 2024-01-15T21:32:29.587920123Z ^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.587927757Z File "/engine.py", line 57, in _initialize_llm 2024-01-15T21:32:29.588049626Z raise e 2024-01-15T21:32:29.588066343Z File "/engine.py", line 54, in _initialize_llm 2024-01-15T21:32:29.588169416Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(self.config)) 2024-01-15T21:32:29.588340362Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.588362955Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args 2024-01-15T21:32:29.588594264Z engine = cls(parallel_config.worker_use_ray, 2024-01-15T21:32:29.588675837Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.588704403Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init 2024-01-15T21:32:29.588857283Z self.engine = self._init_engine(*args, kwargs) 2024-01-15T21:32:29.588974769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.589017799Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine 2024-01-15T21:32:29.589179321Z return engine_class(*args, kwargs) 2024-01-15T21:32:29.589276287Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.589306141Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init 2024-01-15T21:32:29.589436730Z self._init_workers(distributed_init_method) 2024-01-15T21:32:29.589445070Z File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers 2024-01-15T21:32:29.589570340Z self._run_workers( 2024-01-15T21:32:29.589578206Z File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers 2024-01-15T21:32:29.589964835Z self._run_workers_in_batch(workers, method, *args, kwargs)) 2024-01-15T21:32:29.589993004Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.589998521Z File "/src/vllm/vllm/engine/llm_engine.py", line 737, in _run_workers_in_batch 2024-01-15T21:32:29.590353319Z output = executor(*args, **kwargs) 2024-01-15T21:32:29.590387619Z ^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.590392816Z File "/src/vllm/vllm/worker/worker.py", line 67, in init_model 2024-01-15T21:32:29.590540725Z torch.cuda.set_device(self.device) 2024-01-15T21:32:29.590547462Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 404, in set_device 2024-01-15T21:32:29.590728911Z torch._C._cuda_setDevice(device) 2024-01-15T21:32:29.590792281Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 298, in _lazy_init 2024-01-15T21:32:29.590940554Z torch._C._cuda_init() 2024-01-15T21:32:29.590948904Z RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu. Tried using base default mistral and getting CUDA errors still lol.

ashleyk•16mo ago

Probably related to trust_remote_code, it has to be true for Mixtral.

ConceptOP•16mo ago

It worked before which is super weird. The CUDA error was also for mistral

ashleyk•16mo ago

Oh yeah thats strange then, didn't realise it was working Which mistral model?

ConceptOP•16mo ago

mistralai/Mistral-7B-v0.1

ashleyk•16mo ago

Thats a pretty small model so there shouldn't be issues

ConceptOP•16mo ago

Yeah not sure how this is giving me CUDA errors.

ashleyk•16mo ago

Probably need to log a Github issue for it. https://github.com/runpod-workers/worker-vllm/issues

GitHub

Issues · runpod-workers/worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - Issues · runpod-workers/worker-vllm

ConceptOP•16mo ago

Still just keep getting st uck at this stage.

antoniog•15mo ago

Hey! I had a similar issue with loading awq models with this worker. I resolved it by setting GPU_MEMORY_UTILIZATION variable to 0.90. One more thing. It's recommended to use CUDA verson of 12.1. Try to change it by setting env variable WORKER_CUDA_VERSION to 12.1 I'm not sure but you should probably change it in the Dockerfile. Setting it as an env variable probably won't work. (I may be wrong.)

ashleyk•15mo ago

Yeah we need the CUDA version filter for serverless like GPU cloud has.

ConceptOP•15mo ago

Is that an environment vaariable? Just kidding found it. @antoniog Also wondering if you baked your model into the docker image? The spin up time while using network volume is quite slow.

Justin Merrell•15mo ago

@Alpay Ariyak if you get a chance to glance this over.

ConceptOP•15mo ago

Can't seem to use 12.1

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 12.1* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu121; \
        python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 12.1* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu121; \
        python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

It can't find the branch for

https://github.com/runpod/[email protected]#egg=vllm

https://github.com/runpod/[email protected]#egg=vllm

. Should I just have it use 11.8?

Alpay Ariyak•15mo ago

Sorry, it's not that intuitive, but if you build from the main branch with --build-arg WORKER_CUDA_VERSION=12.1, it will correctly install everything for 12.1, so you dont need to modify the dockerfile

ConceptOP•15mo ago

Okay thank you!

Alpay Ariyak•15mo ago

What are some missing features and issues that made you build your own images rather than use the pre-built one? @here

ConceptOP•15mo ago

I'm baking in my model to test out if it will make my workers faster. When using the default image, I'm getting delay times of up to 700s for it to load the model. With the long delay time, it spins up other workers thus increasing my cost. I'm also trying the solution that @antoniog gave of changing the GPU Util value

ConceptOP•15mo ago

Error log of a fresh fork from the repo. No module named 'numpy'.

# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.2-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
    HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
    HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
    TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
    HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
    python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt && \
    rm /requirements.txt

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
        python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ"
ARG MODEL_BASE_PATH="/models"
ARG QUANTIZATION="awq"

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
    MODEL_NAME=$MODEL_NAME \
    QUANTIZATION=$QUANTIZATION 

RUN --mount=type=secret,id=HF_TOKEN,required=false \
    if [ -f /run/secrets/HF_TOKEN ]; then \
        export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
    fi && \
    if [ -n "$MODEL_NAME" ]; then \
        python3.11 /download_model.py --model $MODEL_NAME; \
    fi

# Start the handler
CMD ["python3.11", "/handler.py"]

# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.2-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
    HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
    HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
    TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
    HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
    python3.11 -m pip install --upgrade pip && \
    python3.11 -m pip install --upgrade -r /requirements.txt && \
    rm /requirements.txt

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
        python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
        python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
    else \
        python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
    fi && \
    rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ"
ARG MODEL_BASE_PATH="/models"
ARG QUANTIZATION="awq"

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
    MODEL_NAME=$MODEL_NAME \
    QUANTIZATION=$QUANTIZATION 

RUN --mount=type=secret,id=HF_TOKEN,required=false \
    if [ -f /run/secrets/HF_TOKEN ]; then \
        export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
    fi && \
    if [ -n "$MODEL_NAME" ]; then \
        python3.11 /download_model.py --model $MODEL_NAME; \
    fi

# Start the handler
CMD ["python3.11", "/handler.py"]

message.txt

ConceptOP•15mo ago

I'm trying to use the prebuilt now with 0.1.0. Will report back if it works

ConceptOP•15mo ago

This is with the base image of 0.1.0 using OpenChat with no other env variables. It also spun up 3 other workers to accomplish this job.

ConceptOP•15mo ago

Ran it again and it worked much faster hmm.

ConceptOP•15mo ago

Different worker

Alpay Ariyak•15mo ago

Yes, it will always download the model during the first request So the requirement to build an image yourself is linux/ubuntu and an nvidia gpu, which seems to be the issue here

ConceptOP•15mo ago

How long does that stay loaded in the worker?

Alpay Ariyak•15mo ago

It will depend on if you have flashboot on - with it, load times have been under 2s for me through relatively long periods of time

ConceptOP•15mo ago

I do have flash boot on. I’m just worried about those first request load times. Which was why I looked into baking the model into the image

ConceptOP•15mo ago

This is with flasahboot on Also don't want to rely on network storage since it reduces the amount of avaliable GPUs I can use.

Alpay Ariyak•15mo ago

I think for llama 7b our load times were around 22 seconds on the machine directly We’re working on improving the speed of model loading in vLLM

Superintendent•15mo ago

what context are u running at not only do you need to have space for fp16 weights u need space for context, which should be about 2 to 3gb for 4k to 8k context

ConceptOP•15mo ago

So baking in the model doesn't help speed up model loading? Option 2.

Gaming

Programming

Rundpod VLLM Cuda out of Memory

Did you find this page helpful?