R
RunPod2mo ago
pxmwxd

Serverless doesn't scale

Endpoint id: cilhdgrs7rbzya I have some requests which requrie workers with 4 GTX 4090s. “max worker” of the endpoint is 150 and “Request Count” in Scale type is 1. When I sent 78 requests concurrently, only ~20% of these requests could start in 10s. P80 need to wait for ~600s. Is this because there is not enough GPUs? When stock status “availibity: high”, how many workers can I expect to scale up in the mean time?
10 Replies
nerdylive
nerdylive2mo ago
Whats your worker status are they throttled? Try increasing your max workers if your wokrers are full And what do you run inside the worker? what kind of model
yhlong00000
yhlong000002mo ago
I think using request count is great for handling a steady or predictable increase in request volume. Setting the count to 1 will immediately increase the workers, which I agree should work. However, for burst traffic, queue delay might work better. You can define the maximum wait time in the queue, ensuring that jobs don’t wait longer than that before they get processed.
flash-singh
flash-singh2mo ago
are you asking for 4x 4090s in 1 worker?
nerdylive
nerdylive2mo ago
I think he's asking about the scaling, when it's high availability Howmuch workers can it scale up to And why the loading time/cold starts is high
pxmwxd
pxmwxd2mo ago
Not cold time. Delay tme is high. It could even reach ~600s
nerdylive
nerdylive2mo ago
ok
flash-singh
flash-singh2mo ago
@pxmwxd can I ask why you need 4x 4090s in one worker? that will impact scale, even if we have plenty of 4090s, wanting 4x will impact scale since most are 2x 4x and rare 8x ones, whats likely happening during scale is your getting throttled pm me endpoint id and i can check to make sure this is the case 2x a6000 will give you easier scale, the higher you increase gpu count/worker, the more likely chance of higher delay time, i can also see if we can optimize this for you I've resolved the issue, for future reference to anyone else scaling too big, you will hit $40/hr spending limit even for serverless, only way to increase that is reaching out to us so you can scale beyond. This also means we need to do a better job of showing that possibly in logs.
marcchen955
marcchen9552mo ago
Is there any doc link about the 40$/hr limitation ? I am trying to research on a replacement of runpod to modal.com. The first priority thing is gpu concurrency limitation. (Which in modal.com is 30 for pro user)
nerdylive
nerdylive2mo ago
Just reach out to them via contact Link in the website dashboard
flash-singh
flash-singh2mo ago
we can increase that if needed, reach out to support
Want results from more Discord servers?
Add your server