RunPod•10mo ago

Trying to load a huge model into serverless

https://huggingface.co/cognitivecomputations/dolphin-2.9.2-qwen2-72b Anyone have any idea how to do this in vLLM? I've deployed using two 80GB gpus and have had no luck

cognitivecomputations/dolphin-2.9.2-qwen2-72b · Hugging Face

13 Replies

blabbercrabOP•10mo ago

2024-07-07T10:13:37.060080427Z INFO 07-07 10:13:37 ray_utils.py:96] Total CPUs: 252 2024-07-07T10:13:37.060112418Z INFO 07-07 10:13:37 ray_utils.py:97] Using 252 CPUs 2024-07-07T10:13:39.223150657Z 2024-07-07 10:13:39,222 INFO worker.py:1753 -- Started a local Ray instance. 2024-07-07T10:13:42.909013372Z INFO 07-07 10:13:42 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='cognitivecomputations/dolphin-2.9.2-qwen2-72b', speculative_config=None, tokenizer='cognitivecomputations/dolphin-2.9.2-qwen2-72b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir='/runpod-volume/huggingface-cache/hub', load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=cognitivecomputations/dolphin-2.9.2-qwen2-72b) 2024-07-07T10:13:43.234774592Z Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 2024-07-07T10:13:48.090819086Z INFO 07-07 10:13:48 utils.py:628] Found nccl from environment variable VLLM_NCCL_SO_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2 2024-07-07T10:13:49.634162208Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:48 utils.py:628] Found nccl from environment variable VLLM_NCCL_SO_PATH=/usr/lib/x86_64-linux-gnu/libnccl.so.2 2024-07-07T10:13:49.634349607Z INFO 07-07 10:13:49 selector.py:27] Using FlashAttention-2 backend. 2024-07-07T10:13:50.971622090Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:49 selector.py:27] Using FlashAttention-2 backend. 2024-07-07T10:13:50.971661235Z INFO 07-07 10:13:50 pynccl_utils.py:43] vLLM is using nccl==2.17.1 2024-07-07T10:13:51.888246699Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:50 pynccl_utils.py:43] vLLM is using nccl==2.17.1 2024-07-07T10:13:51.888281517Z INFO 07-07 10:13:51 utils.py:118] generating GPU P2P access cache for in /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json 2024-07-07T10:13:51.889113795Z INFO 07-07 10:13:51 utils.py:132] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json 2024-07-07T10:13:51.889199350Z WARNING 07-07 10:13:51 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. 2024-07-07T10:13:52.655130972Z (RayWorkerWrapper pid=14238) INFO 07-07 10:13:51 utils.py:132] reading GPU P2P access cache from /root/.config/vllm/gpu_p2p_access_cache_for_0,1.json 2024-07-07T10:13:52.655172182Z (RayWorkerWrapper pid=14238) WARNING 07-07 10:13:51 custom_all_reduce.py:74] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. 2024-07-07T10:13:52.655176579Z INFO 07-07 10:13:52 weight_utils.py:200] Using model weights format ['*.safetensors']

digigoblin•10mo ago

There is no error, that last log line means its still busy loading the model.

blabbercrabOP•10mo ago

I wasnt able to load it using one 80gb gpu, isnt 2 x 80gb excessive for the model size?

digigoblin•10mo ago

I asusme you're loading it from network storage?

digigoblin•10mo ago

Add up these file sizes: https://huggingface.co/cognitivecomputations/dolphin-2.9.2-qwen2-72b/tree/main

cognitivecomputations/dolphin-2.9.2-qwen2-72b at main

digigoblin•10mo ago

They are considerably more than 80GB, so it definitely won't fit into a single 80GB GPU

Encyrption•10mo ago

Max is 2x80GB with serverless?

digigoblin•10mo ago

You can also use multiple 48GB and 24GB

yhlong00000•10mo ago

with 2x80GB I am not able to run that model, it give me out of memory error, I switch to 8x48GB and that works.😂 And btw, you have to select 1, 2, 4, 8, 16, or 32 GPUs, can't pick 10

yhlong00000•10mo ago

also 4x48GB = 192GB won't work as well, out of memory😂

Jason•10mo ago

Correct Ahh, so more than that damn, so heavy on memory

blabbercrabOP•10mo ago

https://tenor.com/view/gg-gojo-usb-usb-gojo-rip-bozo-gif-5949329359688209853

Tenor

Jason•10mo ago

Memory problem? never heard of it if the post author is still reading until this point, maybe because your model is still loading....

Gaming

Programming

Trying to load a huge model into serverless

Did you find this page helpful?