R
RunPodβ€’11mo ago
Concept

Rundpod VLLM Cuda out of Memory

Hi I've been using the default runpod VLLM template with the mixtrial model loaded in the network volume. I'm encountering CUDA out of memory on cold starts. Here is the error log. 2024-01-15T20:32:13.726720287Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 47.54 GiB of which 16.75 MiB is free. Process 422202 has 47.51 GiB memory in use. Of the allocated memory 47.05 GiB is allocated by PyTorch, and 12.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
57 Replies
ashleyk
ashleykβ€’11mo ago
Which Mixtral model?
Concept
ConceptOPβ€’11mo ago
mistralai/Mixtral-8x7B-v0.1
ashleyk
ashleykβ€’11mo ago
Thats too big to fit into 48GB, you need 2 x A100 for it, you should look at using a quantized version instead, such as TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ This version is also uncensored
Concept
ConceptOPβ€’11mo ago
Does this look correct?
No description
justin
justinβ€’11mo ago
Mixtral is just too much of a memory hog
ashleyk
ashleykβ€’11mo ago
Yeah looks fine
Concept
ConceptOPβ€’11mo ago
2024-01-15T20:52:23.809750811Z File "/handler.py", line 7, in <module> 2024-01-15T20:52:23.809891157Z vllm_engine = VLLMEngine() 2024-01-15T20:52:23.810037390Z ^^^^^^^^^^^^ 2024-01-15T20:52:23.810098653Z File "/engine.py", line 38, in init 2024-01-15T20:52:23.810218453Z self.llm = self._initialize_llm() 2024-01-15T20:52:23.810380946Z ^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T20:52:23.810389493Z File "/engine.py", line 57, in _initialize_llm 2024-01-15T20:52:23.810576492Z raise e 2024-01-15T20:52:23.810592982Z File "/engine.py", line 54, in _initialize_llm 2024-01-15T20:52:23.810735102Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(self.config)) 2024-01-15T20:52:23.811013662Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T20:52:23.811046045Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args 2024-01-15T20:52:23.811394368Z engine = cls(parallel_config.worker_use_ray, 2024-01-15T20:52:23.811521064Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T20:52:23.811549097Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init 2024-01-15T20:52:23.811785870Z self.engine = self._init_engine(*args, kwargs) 2024-01-15T20:52:23.811983677Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T20:52:23.812010660Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine 2024-01-15T20:52:23.812252599Z return engine_class(*args, **kwargs) 2024-01-15T20:52:23.812439826Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T20:52:23.812447822Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init 2024-01-15T20:52:23.812621912Z self._init_workers(distributed_init_method) 2024-01-15T20:52:23.812687615Z File "/src/vllm/vllm/engine/llm_engine.py", line 146, in _init_workers 2024-01-15T20:52:23.812863558Z self._run_workers( Looks like I'm getting a disk quota exceeded
ashleyk
ashleykβ€’11mo ago
Is your network volume full? Or didn't you add the other environment variables for huggingface cache etc?
justin
justinβ€’11mo ago
prob increase ur container volume higher too - 5 is tiny
ashleyk
ashleykβ€’11mo ago
Not necessary if the environment variables are set correctly 5GB is enough
Concept
ConceptOPβ€’11mo ago
I increased my network volume and got rid of the that problem. I'll probably wipe my network volume so it doesn't have the old model on there anymore.
Concept
ConceptOPβ€’11mo ago
No description
Concept
ConceptOPβ€’11mo ago
My jobs are getting stuck here but the model is loading in fine.
ashleyk
ashleykβ€’11mo ago
Use GPTQ not AWQ I sent you this one TheBloke/dolphin-2.7-mixtral-8x7b-GPTQ not sure why you changed it to AWQ
Concept
ConceptOPβ€’11mo ago
Changed it because of this lol oops
No description
Concept
ConceptOPβ€’11mo ago
No description
ashleyk
ashleykβ€’11mo ago
Oh, don't know why the README says that because your screenshot says AWQ quantization is not fully optimized yet πŸ€·β€β™‚οΈ
Concept
ConceptOPβ€’11mo ago
IT says the same with gptq
Concept
ConceptOPβ€’11mo ago
No description
ashleyk
ashleykβ€’11mo ago
Oh okay, my bad sorry, AWQ is probably better then.
Concept
ConceptOPβ€’11mo ago
config.json: 0%| | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py :56 2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z raise e
2024-01-15T21:25:12.969535724Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z self._run_workers(
2024-01-15T21:25:12.971006647Z File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
config.json: 0%| | 0.00/1.06k [00:00<?, ?B/s]
config.json: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.06k/1.06k [00:00<00:00, 3.12MB/s]
2024-01-15T21:25:12.631302229Z WARNING 01-15 21:25:12 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-01-15T21:25:12.631674354Z INFO 01-15 21:25:12 llm_engine.py:73] Initializing an LLM engine with config: model='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer='TheBloke/dolphin-2.7-mixtral-8x7b-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
2024-01-15T21:25:12.878504399Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-01-15T21:25:12.968988083Z engine.py :56 2024-01-15 21:25:12,968 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu.
2024-01-15T21:25:12.969011647Z Traceback (most recent call last):
2024-01-15T21:25:12.969017540Z File "/handler.py", line 7, in <module>
2024-01-15T21:25:12.969079606Z vllm_engine = VLLMEngine()
2024-01-15T21:25:12.969193802Z ^^^^^^^^^^^^
2024-01-15T21:25:12.969200856Z File "/engine.py", line 38, in __init__
2024-01-15T21:25:12.969316102Z self.llm = self._initialize_llm()
2024-01-15T21:25:12.969405051Z ^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969414475Z File "/engine.py", line 57, in _initialize_llm
2024-01-15T21:25:12.969528071Z raise e
2024-01-15T21:25:12.969535724Z File "/engine.py", line 54, in _initialize_llm
2024-01-15T21:25:12.969631284Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**self.config))
2024-01-15T21:25:12.969861773Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.969879906Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
2024-01-15T21:25:12.970089032Z engine = cls(parallel_config.worker_use_ray,
2024-01-15T21:25:12.970165425Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970203878Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in __init__
2024-01-15T21:25:12.970334541Z self.engine = self._init_engine(*args, **kwargs)
2024-01-15T21:25:12.970462593Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970470800Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine
2024-01-15T21:25:12.970637632Z return engine_class(*args, **kwargs)
2024-01-15T21:25:12.970733855Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-01-15T21:25:12.970746229Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in __init__
2024-01-15T21:25:12.970891958Z self._init_workers(distributed_init_method)
2024-01-15T21:25:12.970907978Z File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers
2024-01-15T21:25:12.970999331Z self._run_workers(
2024-01-15T21:25:12.971006647Z File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers
2024-01-15T21:25:12.971260200Z self._run_workers_in_batch(workers, method, *args, **kwargs))
2024-01-15T21:25:12.971432982Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AWQ Seems to be faulty too. CUDA seems to be breaking.
ashleyk
ashleykβ€’11mo ago
trust_remote_code needs to be set to TRUE for Mixtral, not sure whether thats causing the issue.
Concept
ConceptOPβ€’11mo ago
I don't see that as an enviornment var in the readme
ashleyk
ashleykβ€’11mo ago
Might need to fork it and add it yourself πŸ™ˆ . Are you still using 48GB GPU tier?
Concept
ConceptOPβ€’11mo ago
Yes. 2024-01-15T21:32:29.206204204Z INFO 01-15 21:32:29 llm_engine.py:73] Initializing an LLM engine with config: model='mistralai/Mistral-7B-v0.1', tokenizer='mistralai/Mistral-7B-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir='/runpod-volume/', load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0) 2024-01-15T21:32:29.587369692Z engine.py :56 2024-01-15 21:32:29,586 Error initializing vLLM engine: CUDA driver initialization failed, you might not have a CUDA gpu. 2024-01-15T21:32:29.587459992Z Traceback (most recent call last): 2024-01-15T21:32:29.587477182Z File "/handler.py", line 7, in <module> 2024-01-15T21:32:29.587597751Z vllm_engine = VLLMEngine() 2024-01-15T21:32:29.587676721Z ^^^^^^^^^^^^ 2024-01-15T21:32:29.587687568Z File "/engine.py", line 38, in init 2024-01-15T21:32:29.587846707Z self.llm = self._initialize_llm() 2024-01-15T21:32:29.587920123Z ^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.587927757Z File "/engine.py", line 57, in _initialize_llm 2024-01-15T21:32:29.588049626Z raise e 2024-01-15T21:32:29.588066343Z File "/engine.py", line 54, in _initialize_llm 2024-01-15T21:32:29.588169416Z return AsyncLLMEngine.from_engine_args(AsyncEngineArgs(self.config)) 2024-01-15T21:32:29.588340362Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.588362955Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 496, in from_engine_args 2024-01-15T21:32:29.588594264Z engine = cls(parallel_config.worker_use_ray, 2024-01-15T21:32:29.588675837Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.588704403Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 269, in init 2024-01-15T21:32:29.588857283Z self.engine = self._init_engine(*args, kwargs) 2024-01-15T21:32:29.588974769Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.589017799Z File "/src/vllm/vllm/engine/async_llm_engine.py", line 314, in _init_engine 2024-01-15T21:32:29.589179321Z return engine_class(*args, kwargs) 2024-01-15T21:32:29.589276287Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.589306141Z File "/src/vllm/vllm/engine/llm_engine.py", line 110, in init 2024-01-15T21:32:29.589436730Z self._init_workers(distributed_init_method) 2024-01-15T21:32:29.589445070Z File "/src/vllm/vllm/engine/llm_engine.py", line 142, in _init_workers 2024-01-15T21:32:29.589570340Z self._run_workers( 2024-01-15T21:32:29.589578206Z File "/src/vllm/vllm/engine/llm_engine.py", line 763, in _run_workers 2024-01-15T21:32:29.589964835Z self._run_workers_in_batch(workers, method, *args, kwargs)) 2024-01-15T21:32:29.589993004Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.589998521Z File "/src/vllm/vllm/engine/llm_engine.py", line 737, in _run_workers_in_batch 2024-01-15T21:32:29.590353319Z output = executor(*args, **kwargs) 2024-01-15T21:32:29.590387619Z ^^^^^^^^^^^^^^^^^^^^^^^^^ 2024-01-15T21:32:29.590392816Z File "/src/vllm/vllm/worker/worker.py", line 67, in init_model 2024-01-15T21:32:29.590540725Z torch.cuda.set_device(self.device) 2024-01-15T21:32:29.590547462Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 404, in set_device 2024-01-15T21:32:29.590728911Z torch._C._cuda_setDevice(device) 2024-01-15T21:32:29.590792281Z File "/usr/local/lib/python3.11/dist-packages/torch/cuda/init.py", line 298, in _lazy_init 2024-01-15T21:32:29.590940554Z torch._C._cuda_init() 2024-01-15T21:32:29.590948904Z RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu. Tried using base default mistral and getting CUDA errors still lol.
ashleyk
ashleykβ€’11mo ago
Probably related to trust_remote_code, it has to be true for Mixtral.
Concept
ConceptOPβ€’11mo ago
It worked before which is super weird. The CUDA error was also for mistral
ashleyk
ashleykβ€’11mo ago
Oh yeah thats strange then, didn't realise it was working Which mistral model?
Concept
ConceptOPβ€’11mo ago
mistralai/Mistral-7B-v0.1
ashleyk
ashleykβ€’11mo ago
Thats a pretty small model so there shouldn't be issues
Concept
ConceptOPβ€’11mo ago
No description
Concept
ConceptOPβ€’11mo ago
Yeah not sure how this is giving me CUDA errors.
ashleyk
ashleykβ€’11mo ago
Probably need to log a Github issue for it. https://github.com/runpod-workers/worker-vllm/issues
GitHub
Issues Β· runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - Issues Β· runpod-workers/worker-vllm
Concept
ConceptOPβ€’11mo ago
Still just keep getting st uck at this stage.
No description
antoniog
antoniogβ€’11mo ago
Hey! I had a similar issue with loading awq models with this worker. I resolved it by setting GPU_MEMORY_UTILIZATION variable to 0.90. One more thing. It's recommended to use CUDA verson of 12.1. Try to change it by setting env variable WORKER_CUDA_VERSION to 12.1 I'm not sure but you should probably change it in the Dockerfile. Setting it as an env variable probably won't work. (I may be wrong.)
ashleyk
ashleykβ€’11mo ago
Yeah we need the CUDA version filter for serverless like GPU cloud has.
Concept
ConceptOPβ€’11mo ago
Is that an environment vaariable? Just kidding found it. @antoniog Also wondering if you baked your model into the docker image? The spin up time while using network volume is quite slow.
Justin Merrell
Justin Merrellβ€’11mo ago
@Alpay Ariyak if you get a chance to glance this over.
Concept
ConceptOPβ€’11mo ago
Can't seem to use 12.1
# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 12.1* ]]; then \
python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu121; \
python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
else \
python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
fi && \
rm -rf /root/.cache/pip
# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 12.1* ]]; then \
python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu121; \
python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
else \
python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
fi && \
rm -rf /root/.cache/pip
It can't find the branch for
https://github.com/runpod/[email protected]#egg=vllm
https://github.com/runpod/[email protected]#egg=vllm
. Should I just have it use 11.8?
Alpay Ariyak
Alpay Ariyakβ€’11mo ago
Sorry, it's not that intuitive, but if you build from the main branch with --build-arg WORKER_CUDA_VERSION=12.1, it will correctly install everything for 12.1, so you dont need to modify the dockerfile
Concept
ConceptOPβ€’11mo ago
Okay thank you!
Alpay Ariyak
Alpay Ariyakβ€’11mo ago
What are some missing features and issues that made you build your own images rather than use the pre-built one? @here
Concept
ConceptOPβ€’11mo ago
I'm baking in my model to test out if it will make my workers faster. When using the default image, I'm getting delay times of up to 700s for it to load the model. With the long delay time, it spins up other workers thus increasing my cost. I'm also trying the solution that @antoniog gave of changing the GPU Util value
Concept
ConceptOPβ€’11mo ago
Error log of a fresh fork from the repo. No module named 'numpy'.
# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.2-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
python3.11 -m pip install --upgrade pip && \
python3.11 -m pip install --upgrade -r /requirements.txt && \
rm /requirements.txt

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
else \
python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
fi && \
rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ"
ARG MODEL_BASE_PATH="/models"
ARG QUANTIZATION="awq"

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
MODEL_NAME=$MODEL_NAME \
QUANTIZATION=$QUANTIZATION

RUN --mount=type=secret,id=HF_TOKEN,required=false \
if [ -f /run/secrets/HF_TOKEN ]; then \
export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
fi && \
if [ -n "$MODEL_NAME" ]; then \
python3.11 /download_model.py --model $MODEL_NAME; \
fi

# Start the handler
CMD ["python3.11", "/handler.py"]
# syntax = docker/dockerfile:1.3
ARG WORKER_CUDA_VERSION=11.8
FROM runpod/base:0.4.2-cuda${WORKER_CUDA_VERSION}.0 as builder

ARG WORKER_CUDA_VERSION=11.8 # Required duplicate to keep in scope

# Set Environment Variables
ENV WORKER_CUDA_VERSION=${WORKER_CUDA_VERSION} \
HF_DATASETS_CACHE="/runpod-volume/huggingface-cache/datasets" \
HUGGINGFACE_HUB_CACHE="/runpod-volume/huggingface-cache/hub" \
TRANSFORMERS_CACHE="/runpod-volume/huggingface-cache/hub" \
HF_TRANSFER=1


# Install Python dependencies
COPY builder/requirements.txt /requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
python3.11 -m pip install --upgrade pip && \
python3.11 -m pip install --upgrade -r /requirements.txt && \
rm /requirements.txt

# Install torch and vllm based on CUDA version
RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
else \
python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
fi && \
rm -rf /root/.cache/pip

# Add source files
COPY src .

# Setup for Option 2: Building the Image with the Model included
ARG MODEL_NAME="TheBloke/mixtral-8x7b-v0.1-AWQ"
ARG MODEL_BASE_PATH="/models"
ARG QUANTIZATION="awq"

ENV MODEL_BASE_PATH=$MODEL_BASE_PATH \
MODEL_NAME=$MODEL_NAME \
QUANTIZATION=$QUANTIZATION

RUN --mount=type=secret,id=HF_TOKEN,required=false \
if [ -f /run/secrets/HF_TOKEN ]; then \
export HF_TOKEN=$(cat /run/secrets/HF_TOKEN); \
fi && \
if [ -n "$MODEL_NAME" ]; then \
python3.11 /download_model.py --model $MODEL_NAME; \
fi

# Start the handler
CMD ["python3.11", "/handler.py"]
Concept
ConceptOPβ€’11mo ago
I'm trying to use the prebuilt now with 0.1.0. Will report back if it works
Concept
ConceptOPβ€’11mo ago
This is with the base image of 0.1.0 using OpenChat with no other env variables. It also spun up 3 other workers to accomplish this job.
No description
Concept
ConceptOPβ€’11mo ago
Ran it again and it worked much faster hmm.
No description
Concept
ConceptOPβ€’11mo ago
Different worker
Alpay Ariyak
Alpay Ariyakβ€’11mo ago
Yes, it will always download the model during the first request So the requirement to build an image yourself is linux/ubuntu and an nvidia gpu, which seems to be the issue here
Concept
ConceptOPβ€’11mo ago
How long does that stay loaded in the worker?
Alpay Ariyak
Alpay Ariyakβ€’11mo ago
It will depend on if you have flashboot on - with it, load times have been under 2s for me through relatively long periods of time
Concept
ConceptOPβ€’11mo ago
I do have flash boot on. I’m just worried about those first request load times. Which was why I looked into baking the model into the image
Concept
ConceptOPβ€’11mo ago
No description
Concept
ConceptOPβ€’11mo ago
This is with flasahboot on Also don't want to rely on network storage since it reduces the amount of avaliable GPUs I can use.
Alpay Ariyak
Alpay Ariyakβ€’11mo ago
I think for llama 7b our load times were around 22 seconds on the machine directly We’re working on improving the speed of model loading in vLLM
Superintendent
Superintendentβ€’11mo ago
what context are u running at not only do you need to have space for fp16 weights u need space for context, which should be about 2 to 3gb for 4k to 8k context
Concept
ConceptOPβ€’11mo ago
So baking in the model doesn't help speed up model loading? Option 2.
Want results from more Discord servers?
Add your server