vllm + Ray issue: Stuck on "Started a local Ray instance."
Trying to run
TheBloke/goliath-120b-AWQ
on vllm + runpod with 2x48GB
GPUs:
It's stuck on Started a local Ray instance.
and I've tried both with and without RunPod's FlashBoot
has anyone encountered this issue before?
requirements.txt:
build script:
initialization code:
9 Replies
Are you using a pod, or a serverless endpoint with worker vllm?
Serverless endpoint with vllm (custom minimal image)
This is because ray doesn't get initialized with the right CPU count
You can try this out https://github.com/runpod-workers/worker-vllm and play with lowering the environment variable
VLLM_CPU_FRACTION
, which will be 1 by defaultGitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - GitHub - runpod-workers/worker-vllm: The RunPod worker template for serving our large language model en...
- Tensor Parallelism: Note that the more GPUs you split a model's weights accross, the slower it will be due to inter-GPU communication overhead. If you can fit the model on a single GPU, it is recommended to do so. -I can't really find any references to this specific env var anywhere but the readme (I tried looking inTENSOR_PARALLEL_SIZE
: Number of GPUs to shard the model across (default:1
). - If you are having issues loading your model with Tensor Parallelism, try decreasingVLLM_CPU_FRACTION
(default:1
).
vllm
docs and worker-vllm
code)... are there any docs specifying the exact value that is required? perhaps $(nproc)
or since this is a "fraction"... automagically populate it with 1 / multiprocessing.cpu_count()
?It's because worker vllm uses a fork of vllm
GitHub
vllm-fork-for-sls-worker/vllm/engine/ray_utils.py at 6e5e9a40d4d586...
A high-throughput and memory-efficient inference and serving engine for LLMs - runpod/vllm-fork-for-sls-worker
oh, that makes sense... I might try rebuilding an image using that fork instead, Thanks!
Ofc! Lmk how it goes