Serverless doesn't scale
Endpoint id: cilhdgrs7rbzya
I have some requests which requrie workers with 4 GTX 4090s. “max worker” of the endpoint is 150 and “Request Count” in Scale type is 1.
When I sent 78 requests concurrently, only ~20% of these requests could start in 10s. P80 need to wait for ~600s.
Is this because there is not enough GPUs? When stock status “availibity: high”, how many workers can I expect to scale up in the mean time?
10 Replies
Whats your worker status
are they throttled?
Try increasing your max workers if your wokrers are full
And what do you run inside the worker? what kind of model
I think using request count is great for handling a steady or predictable increase in request volume. Setting the count to 1 will immediately increase the workers, which I agree should work. However, for burst traffic, queue delay might work better. You can define the maximum wait time in the queue, ensuring that jobs don’t wait longer than that before they get processed.
are you asking for 4x 4090s in 1 worker?
I think he's asking about the scaling, when it's high availability Howmuch workers can it scale up to
And why the loading time/cold starts is high
Not cold time. Delay tme is high. It could even reach ~600s
ok
@pxmwxd can I ask why you need 4x 4090s in one worker? that will impact scale, even if we have plenty of 4090s, wanting 4x will impact scale since most are 2x 4x and rare 8x ones, whats likely happening during scale is your getting throttled
pm me endpoint id and i can check to make sure this is the case
2x a6000 will give you easier scale, the higher you increase gpu count/worker, the more likely chance of higher delay time, i can also see if we can optimize this for you
I've resolved the issue, for future reference to anyone else scaling too big, you will hit $40/hr spending limit even for serverless, only way to increase that is reaching out to us so you can scale beyond. This also means we need to do a better job of showing that possibly in logs.
Is there any doc link about the 40$/hr limitation ? I am trying to research on a replacement of runpod to modal.com. The first priority thing is gpu concurrency limitation. (Which in modal.com is 30 for pro user)
Just reach out to them via contact
Link in the website dashboard
we can increase that if needed, reach out to support