RunPod•7mo ago

Error getting response from a serverless deployment

I tried to create multiple serverless vLLM deployments and even picked the top end GPU. However the requests would always go to a in-progress status and would not respond. I'm building a chatapp and such a slow response isn't acceptable. Is there something else I should do? I had selected all default option for the google/gemma-2b model while creating the deployment. I know the requests from my app hit runpod as I could see the requests and their status but it would never respond back. I was trying to use the OpanAI compatible end points. Would appreciate any help on this

10 Replies

nerdylive•7mo ago

Are the request coming constantly? You're using vllm right?

abOP•7mo ago

It was 1 request and yes I'm using vllm. I made sure that I purchased enough credit and tried to run my POC but if this is the type of response, then I'm worried it might not work for us. I wonder how others are able to get fast inference

nerdylive•7mo ago

There is time to download and load the model Those I'm sure are the longest time spent for the first request Try subsequent requests

Encyrption•7mo ago

Are you using a network volume? Doing so can add 30 - 60+ seconds delay using your endpoint (for each request). Baking your models into your image will reduce startup delay. Have you enabled Flashboot? This will speed up subsequent requests (after initial request). How many active workers have you set? If you have at least 1 it will speed up requests. How many max workers have you set? I suggest you set this to 30, since 30 is the max (without requesting additional). You don't pay for these workers until they respond to requests. Having enough workers to respond to your requests will really speed things up.

abOP•7mo ago

Actually even the subsequent requests were taking a long time. I tried with multiple models. Perhaps the documentation can be updated to reflect what exactly someone needs to do to have a faster inferencing service when hosted at runpod

Encyrption•7mo ago

It depends on how the code in the image loads the model. If it loads the model for each request that will take a long time per request. Fastest is to load the model into your Docker image at build time, and insure your code uses that model (doesn't download it again). If you cannot do that and are loading your model (not baked into docker image) you should do this outside the RunPod handler, like a startup script. Anything inside then handler will run on every request, so you don't want to load models there. I agree the docs could/should be updated. But, if you have used Runpod very much I am sure you will have experienced their support, or lack thereof, so don't hold your breath waiting for new docs. This server is your best chance at getting support but that is just from friendly users not RunPod the company. I would be happy to look at your code to see if I can offer any suggestions.

abOP•7mo ago

Thanks. Actually I didn't even write any code. I was trying their serverless vLLM quick deploy template and making a call to the OpenAI compatible end point that shows up post deployment. It was very basic

Encyrption•7mo ago

I don't trust those, there is no way to know what they are doing under the hood. Plus it seems by the time they are added to this list they are outdated. 😦

yhlong00000•7mo ago

hey, I created one vLLM endpoint with google/gemma-2b, speed seems ok? around one second and I just use the runsync. However the output is seems not very good? Btw, I set active worker to be 1 so I won't experience any cold start.

yhlong00000•7mo ago

And with 80GB of memory, it’s filling up almost 90%. You might consider adding more GPUs to enhance performance. Additionally, there are numerous adjustments you can make. I’m not an expert in those areas, so can’t provide detailed advice, but I encourage you to explore the options: https://docs.runpod.io/serverless/workers/vllm/environment-variables

Environment variables | RunPod Documentation

Configure your vLLM Worker with environment variables to control model selection, access credentials, and operational parameters for optimal performance. This guide provides a reference for CUDA versions, image tags, and environment variable settings for model-specific configurations.

Gaming

Programming

Error getting response from a serverless deployment

Did you find this page helpful?