RunPod•12mo ago

Too many failed requests

Hello. I've tried to run casperhansen/mixtral-instruct-awq (https://huggingface.co/casperhansen/mixtral-instruct-awq) on A100 80 GB and A100 SXM 80GB GPUs, sending 10 requests per second using this script https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py. However most of the requests failed with Aborted request log from vLLM. This issue didn't occur on another platform with the same GPU, and same code, so I'm not sure if the problem is with vLLM or with RunPod's internal processing. Could anyone provide guidance on what the cause might be?

casperhansen/mixtral-instruct-awq · Hugging Face

Solution:

Why are you using GPU cloud for this? If you want to handle many concurrent requests, you need to use Serverless not GPU cloud. https://github.com/runpod-workers/worker-vllm...

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Jump to solution

4 Replies

Solution

ashleyk•12mo ago

Why are you using GPU cloud for this? If you want to handle many concurrent requests, you need to use Serverless not GPU cloud. https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

norefreshingOP•12mo ago

Thank you for your reply. I wanted to test how many requests it can manage. I'm still learning about LLMs and how to host them, I wasn't aware that GPU cloud is suitable for handling many concurrent requests. Could you kindly explain a bit more about why serverless is preferable in this context compared to GPU clouds or any documents that I could check for more detailed information?

ashleyk•12mo ago

https://docs.runpod.io/serverless/overview

Overview | RunPod Documentation

An overview to Serverless GPU computing for AI inference and training.

norefreshingOP•12mo ago

Thank you. I appreciate it 🙂

Gaming

Programming

Too many failed requests

Did you find this page helpful?