Serverless calculating capacity & ideal request count vs. queue delay values
How do you calculate whether serverless worker is reaching it's capacity and what values to set for request count? I see in one of my serverless workers in production which is running regular Oobabooga (not vLLM so no concurrency) reaching 110k requests per day yesterday without starting a new worker.
According to my observation my context length is usually 1000 input tokens and 10-70 output tokens which usually take between 2-5secs per request. Even if we take 1sec execution time per request it should have been able to handle only 86400 requests per day.
How is it able to handle more without increasing the worker count especially when it takes 2-5secs per request?
2 Replies
@flash-singh any idea?
if your max worker is low, good metric to look put for is delayed time, that shows how long a request waits in queue before a worker picks it up