Trying to load a huge model into serverless
https://huggingface.co/cognitivecomputations/dolphin-2.9.2-qwen2-72b
Anyone have any idea how to do this in vLLM?
I've deployed using two 80GB gpus and have had no luck
13 Replies
2024-07-07T10:13:37.060080427Z INFO 07-07 10:13:37 ray_utils.py:96] Total CPUs: 252
2024-07-07T10:13:37.060112418Z INFO 07-07 10:13:37 ray_utils.py:97] Using 252 CPUs
2024-07-07T10:13:39.223150657Z 2024-07-07 10:13:39,222 INFO worker.py:1753 -- Started a local Ray instance.
2024-07-07T10:13:42.909013372Z INFO 07-07 10:13:42 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='cognitivecomputations/dolphin-2.9.2-qwen2-72b', speculative_config=None, tokenizer='cognitivecomputations/dolphin-2.9.2-qwen2-72b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir='/runpod-volume/huggingface-cache/hub', load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=cognitivecomputations/dolphin-2.9.2-qwen2-72b)
2024-07-07T10:13:43.234774592Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-07T10:13:48.090819086Z INFO 07-07 10:13:48 utils.py:628] Found nccl from environment variable VLLM_NCCL_SO_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2
2024-07-07T10:13:49.634162208Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:48 utils.py:628] Found nccl from environment variable VLLM_NCCL_SO_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2
2024-07-07T10:13:49.634349607Z INFO 07-07 10:13:49 selector.py:27] Using FlashAttention-2 backend.
2024-07-07T10:13:50.971622090Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:49 selector.py:27] Using FlashAttention-2 backend.
2024-07-07T10:13:50.971661235Z INFO 07-07 10:13:50 pynccl_utils.py:43] vLLM is using nccl==2.17.1
2024-07-07T10:13:51.888246699Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:50 pynccl_utils.py:43] vLLM is using nccl==2.17.1
2024-07-07T10:13:51.888281517Z INFO 07-07 10:13:51 utils.py:118] generating GPU P2P access cache for in /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
2024-07-07T10:13:51.889113795Z INFO 07-07 10:13:51 utils.py:132] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
2024-07-07T10:13:51.889199350Z WARNING 07-07 10:13:51 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
2024-07-07T10:13:52.655130972Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:51 utils.py:132] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json
2024-07-07T10:13:52.655172182Z (RayWorkerWrapper pid=14238) WARNING 07-07 10:13:51 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
2024-07-07T10:13:52.655176579Z INFO 07-07 10:13:52 weight_utils.py:200] Using model weights format ['*.safetensors']
There is no error, that last log line means its still busy loading the model.
I wasnt able to load it using one 80gb gpu, isnt 2 x 80gb excessive for the model size?
I asusme you're loading it from network storage?
They are considerably more than 80GB, so it definitely won't fit into a single 80GB GPU
Max is 2x80GB with serverless?
You can also use multiple 48GB and 24GB
with 2x80GB I am not able to run that model, it give me out of memory error, I switch to 8x48GB and that works.😂
And btw, you have to select 1, 2, 4, 8, 16, or 32 GPUs, can't pick 10
also 4x48GB = 192GB won't work as well, out of memory😂
Correct
Ahh, so more than that damn, so heavy on memory
Memory problem? never heard of it
if the post author is still reading until this point, maybe because your model is still loading....