Too many failed requests
Hello. I've tried to run casperhansen/mixtral-instruct-awq (https://huggingface.co/casperhansen/mixtral-instruct-awq) on A100 80 GB and A100 SXM 80GB GPUs, sending 10 requests per second using this script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py.
However most of the requests failed with
Aborted request
log from vLLM. This issue didn't occur on another platform with the same GPU, and same code, so I'm not sure if the problem is with vLLM or with RunPod's internal processing.
Could anyone provide guidance on what the cause might be?Solution:Jump to solution
Why are you using GPU cloud for this? If you want to handle many concurrent requests, you need to use Serverless not GPU cloud.
https://github.com/runpod-workers/worker-vllm...
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
4 Replies
Solution
Why are you using GPU cloud for this? If you want to handle many concurrent requests, you need to use Serverless not GPU cloud.
https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
Thank you for your reply. I wanted to test how many requests it can manage. I'm still learning about LLMs and how to host them, I wasn't aware that GPU cloud is suitable for handling many concurrent requests. Could you kindly explain a bit more about why serverless is preferable in this context compared to GPU clouds or any documents that I could check for more detailed information?
Overview | RunPod Documentation
An overview to Serverless GPU computing for AI inference and training.
Thank you. I appreciate it 🙂