RunPod•7mo ago

Serverless doesn't scale

Endpoint id: cilhdgrs7rbzya I have some requests which requrie workers with 4 GTX 4090s. “max worker” of the endpoint is 150 and “Request Count” in Scale type is 1. When I sent 78 requests concurrently, only ~20% of these requests could start in 10s. P80 need to wait for ~600s. Is this because there is not enough GPUs? When stock status “availibity: high”, how many workers can I expect to scale up in the mean time?

10 Replies

nerdylive•7mo ago

Whats your worker status are they throttled? Try increasing your max workers if your wokrers are full And what do you run inside the worker? what kind of model

yhlong00000•7mo ago

I think using request count is great for handling a steady or predictable increase in request volume. Setting the count to 1 will immediately increase the workers, which I agree should work. However, for burst traffic, queue delay might work better. You can define the maximum wait time in the queue, ensuring that jobs don’t wait longer than that before they get processed.

flash-singh•7mo ago

are you asking for 4x 4090s in 1 worker?

nerdylive•7mo ago

I think he's asking about the scaling, when it's high availability Howmuch workers can it scale up to And why the loading time/cold starts is high

pxmwxdOP•7mo ago

Not cold time. Delay tme is high. It could even reach ~600s

nerdylive•7mo ago

flash-singh•7mo ago

@pxmwxd can I ask why you need 4x 4090s in one worker? that will impact scale, even if we have plenty of 4090s, wanting 4x will impact scale since most are 2x 4x and rare 8x ones, whats likely happening during scale is your getting throttled pm me endpoint id and i can check to make sure this is the case 2x a6000 will give you easier scale, the higher you increase gpu count/worker, the more likely chance of higher delay time, i can also see if we can optimize this for you I've resolved the issue, for future reference to anyone else scaling too big, you will hit $40/hr spending limit even for serverless, only way to increase that is reaching out to us so you can scale beyond. This also means we need to do a better job of showing that possibly in logs.

marcchen955•7mo ago

Is there any doc link about the 40$/hr limitation ? I am trying to research on a replacement of runpod to modal.com. The first priority thing is gpu concurrency limitation. (Which in modal.com is 30 for pro user)

nerdylive•7mo ago

Just reach out to them via contact Link in the website dashboard

flash-singh•7mo ago

we can increase that if needed, reach out to support

Gaming

Programming

Serverless doesn't scale

Did you find this page helpful?