RunPod•15mo ago

vllm + Ray issue: Stuck on "Started a local Ray instance."

Trying to run TheBloke/goliath-120b-AWQ on vllm + runpod with 2x48GB GPUs:

2024-02-03T12:36:44.148649796Z The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
2024-02-03T12:36:44.149745508Z 
0it [00:00, ?it/s]
0it [00:00, ?it/s]
2024-02-03T12:36:44.406220237Z WARNING 02-03 12:36:44 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-02-03T12:36:46.465465797Z 2024-02-03 12:36:46,465    INFO worker.py:1724 -- Started a local Ray instance.

2024-02-03T12:36:44.148649796Z The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
2024-02-03T12:36:44.149745508Z 
0it [00:00, ?it/s]
0it [00:00, ?it/s]
2024-02-03T12:36:44.406220237Z WARNING 02-03 12:36:44 config.py:175] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-02-03T12:36:46.465465797Z 2024-02-03 12:36:46,465    INFO worker.py:1724 -- Started a local Ray instance.

It's stuck on Started a local Ray instance. and I've tried both with and without RunPod's FlashBoot has anyone encountered this issue before? requirements.txt:

vllm==0.2.7
runpod==1.4.0
ray==2.9.1

vllm==0.2.7
runpod==1.4.0
ray==2.9.1

build script:

from huggingface_hub import snapshot_download

snapshot_download(
    "TheBloke/goliath-120b-AWQ",
    local_dir="model",
    local_dir_use_symlinks=False
)

from huggingface_hub import snapshot_download

snapshot_download(
    "TheBloke/goliath-120b-AWQ",
    local_dir="model",
    local_dir_use_symlinks=False
)

initialization code:

from vllm import AsyncLLMEngine, AsyncEngineArgs

llm = AsyncLLMEngine.from_engine_args(
    AsyncEngineArgs(model="./model", quantization="awq", tensor_parallel_size=int(os.getenv("tensor_parallel_size", 1)))
)

from vllm import AsyncLLMEngine, AsyncEngineArgs

llm = AsyncLLMEngine.from_engine_args(
    AsyncEngineArgs(model="./model", quantization="awq", tensor_parallel_size=int(os.getenv("tensor_parallel_size", 1)))
)

9 Replies

Alpay Ariyak•15mo ago

Are you using a pod, or a serverless endpoint with worker vllm?

marshallOP•15mo ago

Serverless endpoint with vllm (custom minimal image)

Alpay Ariyak•15mo ago

This is because ray doesn't get initialized with the right CPU count

Alpay Ariyak•15mo ago

You can try this out https://github.com/runpod-workers/worker-vllm and play with lowering the environment variable VLLM_CPU_FRACTION, which will be 1 by default

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - GitHub - runpod-workers/worker-vllm: The RunPod worker template for serving our large language model en...

marshallOP•15mo ago

- Tensor Parallelism: Note that the more GPUs you split a model's weights accross, the slower it will be due to inter-GPU communication overhead. If you can fit the model on a single GPU, it is recommended to do so. - TENSOR_PARALLEL_SIZE: Number of GPUs to shard the model across (default: 1). - If you are having issues loading your model with Tensor Parallelism, try decreasing VLLM_CPU_FRACTION (default: 1).

I can't really find any references to this specific env var anywhere but the readme (I tried looking in vllm docs and worker-vllm code)... are there any docs specifying the exact value that is required? perhaps $(nproc) or since this is a "fraction"... automagically populate it with 1 / multiprocessing.cpu_count() ?

Alpay Ariyak•15mo ago

It's because worker vllm uses a fork of vllm

Alpay Ariyak•15mo ago

https://github.com/runpod/vllm-fork-for-sls-worker/blob/6e5e9a40d4d58691663ad93b9e1d0a3ca266d2b8/vllm/engine/ray_utils.py#L90-L98

GitHub

vllm-fork-for-sls-worker/vllm/engine/ray_utils.py at 6e5e9a40d4d586...

A high-throughput and memory-efficient inference and serving engine for LLMs - runpod/vllm-fork-for-sls-worker

marshallOP•15mo ago

oh, that makes sense... I might try rebuilding an image using that fork instead, Thanks!

Alpay Ariyak•15mo ago

Ofc! Lmk how it goes

Gaming

Programming

vllm + Ray issue: Stuck on "Started a local Ray instance."

Did you find this page helpful?