Error building worker-vllm docker image for mixtral 8x7b
I'm running the following command to build and tag a docker worker image based off of worker-vllm:
docker build -t lesterhnh/mixtral-8x7b-instruct-v0.1-runpod-serverless:1.0 --build-arg MODEL_NAME="mistralai/Mixtral-8x7B-Instruct-v0.1" --build-arg MODEL_BASE_PATH="/models" .
I'm getting the following error:
------
Dockerfile:23
--------------------
22 | # Install torch and vllm based on CUDA version
23 | >>> RUN if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then \
24 | >>> python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; \
25 | >>> python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; \
26 | >>> else \
27 | >>> python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; \
28 | >>> fi && \
29 | >>> rm -rf /root/.cache/pip
30 |
--------------------
ERROR: failed to solve: process "/bin/bash -o pipefail -c if [[ "${WORKER_CUDA_VERSION}" == 11.8* ]]; then python3.11 -m pip install -U --force-reinstall torch==2.1.2 xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu118; python3.11 -m pip install -e git+https://github.com/runpod/[email protected]#egg=vllm; else python3.11 -m pip install -e git+https://github.com/runpod/vllm-fork-for-sls-worker.git#egg=vllm; fi && rm -rf /root/.cache/pip" did not complete successfully: exit code: 1
GitHub
GitHub - runpod/vllm-fork-for-sls-worker: A high-throughput and mem...
A high-throughput and memory-efficient inference and serving engine for LLMs - GitHub - runpod/vllm-fork-for-sls-worker: A high-throughput and memory-efficient inference and serving engine for LLMs
47 Replies
Are you building this on a system that has a GPU?
cc: @Alpay Ariyak
Yes - building it on a Windows PC with a 4090. I'm running the command on WSL though (Windows Subsystem for Linux), if that helps
Just tried it on regular command prompt, and can confirm that I'm getting the same error
Hi Justin, can you help me with few questions, I need to develop and deploy a RAG system based on open-source LLM, I have tried several times, RunPod Serverless A6000/A100, it starts worker and container, than later it can download 50gb of weights or 150gb, but not all 270gb and just stops and restarts to download the weights again and again burning only money, but no real outcome, I just can't deploy LLaMa-70B, RunPod don't gives me a chance, what should I do? Is Cloud GPU option more suitable and stable for production than Serverless?
If you are downloading weights before the handler starts then your worker is tying out and being remmoved. Ideally you will want to either have weights stored in network storage or have them baked into the worker image.
I have only 1
rp_handler.py
file where all the code are located, it starts after my last command in Docker Container, after it triggers handler function, on line 35 my hugging face weights are starting to download:
AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.float16,
quantization_config=quantization_config
)`
Sometimes it works from the first time, but specifficaly with big LLaMa which weights are downloading up to 1hour in my case it without throwing an error stops container, while there is still 1 queue, small Mistral-7b usually works great, but when I take bigger model it just don't work
So network storage for my workers will definitely help and will be a good deployment practice, in terms of using RunPod?Yes, network storage sounds like what you are missing
@Justin , any idea on the error I'm getting?
Great, thanks a lot, btw I'm located in Eastern Europe, how to choose the best region for me, to open my network storage? By distance, EU-RO-1 and EU-CZ-1 should be the closest, but maybe some regions have more GPUs in general to choose and to work with?
Will check on this in a few hours
Some more detail - it looks like it fails when trying to run setup.py develop for vllm. Ninja is trying to compile and fails.
Also confirming that I'm getting the same error when trying to build Llama2-13b. I was able to build and deploy Llama2-13b a month ago, so something must have changed since then
I also noticed that github is showing the builds failing on CD
@Alpay Ariyak @Justin If you guys have any update on this, I would be interested in knowing the outcome as I am facing this issue as well
For what it's worth, I can't even get it to work using the pre-built Docker image with environment variables. When I use this method to spin up an endpoint, I'm getting CUDA out of memory errors, even though I selected a 48GB GPU
just to be clear - it doesn't let you get past the setup.py script for vllm, correct? this is where it breaks for me with the pre-built dockerfile as well
I built my own where I just added vllm to the requirements.txt file for download and that worked better
or you can do RUN pip install vllm
Yup that's where it breaks for me too
Did you have to specify a specific version for vllm?
Don't think you can build it on Github because I believe it now requires the machine you're building on to have a GPU.
@ashleyk if I'm not mistaken, @wizardjoe mentioned his machine has a GPU and mine also does. The docker image is not working correctly for either of us though and it's breaking at the same point
Working on this
@Herai_Studios @ashleyk Yes, I have a 4090
To confirm, you’re also not able to just
sudo docker build .
?@Alpay Ariyak you mean, "sudo docker build ." without tagging and any of the other args?
Yeah, just without args
Seems to me in general that the issue is not being on something like Ubuntu because I’ve never had issues building on it from different Ubuntu machines
We’re working on a way to allow building with any OS, vllm’s recent updates added the changes that resulted in linux-only installation
Trying "sudo docker build ." now
Were you able to repro the problem on a Windows box?
Just finished running - it fails as well
If it's the case that this won't work on Windows, do you know if it would work on an Ubuntu VM on a windows host running with Hyper-V?
Any updates on this? I also tried this on a Debian box I spun up in Google Cloud, but it also fails during the "Running setup.py develop for vllm" step. This time though, it just freezes completely before showing the word "Killed". The machine has an Nvidia L4 gpu with 24gb VRAM and 64 GB RAM
FYI... For whoever reads this and is having the same issue, I finally got it working by doing the workaround @Herai_Studios suggested and making a few more changes: 1) You have to remove or comment out lines 25-27 in the Dockerfile, so that Docker doesn't try to compile vllm from source, 2) after line 32, add a new line "RUN pip install vllm", which will install the PyPI version of vllm, since we aren't compiling it anymore, and 3) when running the docker build command, specify WORKER_CUDA_VERSION = 12.1, since there is another issue with the latest version of vllm which won't work with CUDA 11.8.
nice! I'll add that if you have CUDA 11.8, the reason vllm won't work is because you have to make sure you have the right PyTorch version that uses CUDA 11.8
so how does it look when it works for you?
@Herai_Studios it spends some time downloading the model safetensors, and then after that, it exports the layers and then writes the image. I haven't tested the endpoint yet, will let you know more tomorrow
Any updates on this on it working? Also struggling to use Mixtral 8x7b AWQ with Runpod VLLM worker. I have 32gb of ram and my machine is crashing at the part where its running setup.py develop for vllm where my RAM just skyrockets.
Unfortunately, Linux is an official requirement for vLLM, and WSL wouldn't work either (https://github.com/vllm-project/vllm/issues/1685)
However, we believe we have a workaround that we're still actively testing
GitHub
Plans to make the installation work on Windows without WSL? · Issue...
I get the following error during install: No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\ Due to my company's policy we can't have W...
I’m using Linux and have an Nvidia GPU
In that case, try adding this before the vLLM installation and adjust
max_jobs
and nvcc_threads
as needed
Yes, just like that, did it work?
It's building now. Waiting
Sounds good
It might take a while with max_jobs=2 so I'd maybe try starting with something like 75% of default, which is the # of cpus you have and go down if you experience crashes
Gotcha. It was the swap memory being full that caused my machine to crash, not the RAM memory per say.
num of cores right?
Yep it def got past the erroring part.
downloading tensors now
Nice! Keep me posted
Got it built and pushed. Loading it into an endpoint and seeing what happens.
CUDA OOM with a 24GB GPU
Trying with a 48GB to see if it fixes
2024-01-19T18:58:40.556447675Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacty of 23.68 GiB of which 39.62 MiB is free. Process 3426835 has 23.63 GiB memory in use. Of the allocated memory 23.22 GiB is allocated by PyTorch, and 33.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
@Alpay Ariyak
For some reason it doesn't accept the chat template for conversation history.
The mixtral you're using is a base model, so it doesn't have a chat template
Mixtral Instruct would have one
Thank you.
2024-01-19T20:28:00.200082421Z INFO 01-19 20:28:00 llm_engine.py:70] Initializing an LLM engine with config: model='TheBloke/mixtral-8x7b-v0.1-AWQ', tokenizer='TheBloke/mixtral-8x7b-v0.1-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir='/models', load_format=auto, tensor_parallel_size=1, quantization=awq, enforce_eager=False, seed=0)
This log is taking the most time. I'm stuck here for about 2-3minutes
So the reason why I'm trying to use Mixtral is the use of experts and also its context window.
I'm open to using OpenChat, would it be possible to increase the context size from 8k or is that set?
@Justin
Mixtral is fine but you need the instruct version of it if you want to have a chat template. Otherwise, you can put your text input as
prompt
The mixtral you’re using is a completion model, not an instruction or chat model, so it doesn’t have a template
Will look into it thank you.
That’s the model loading stage