Using Same GPU for multiple requests?
Hello @here,
I am using ComfyUI + my own custom scripts to generate images. I have set it up in RunPod Serverless (A100) GPUs in the following way:
The request contains an image URL.
The image is downloaded and processed, and the output image is put to S3.
The task takes around 30 seconds
However, only 10% of the GPU memory is used by 1 request AT MAX. I want multiple requests to use the same GPU so that it is faster.
Is there a way to do this? Is there some existing template to handle this scenario?
3 Replies
@flash-singh @Justin
yes look at our vllm worker and how that uses SDK to handle multiple requests in parallel using concurrency, @Justin can provide further details, FYI its holiday weekend support will be limited
GitHub
worker-vllm/src/handler.py at main · runpod-workers/worker-vllm
The RunPod worker template for serving our large language model endpoints. Powered by VLLM. - runpod-workers/worker-vllm