Long latencies
I have a 7B model that is supposed to be very fast (it checks if a claim is supported by a context, and gives a yes/no answer).
If I rent a H100, I can process my prompt and get a response in 100ms (for a prompt that's about 1400 words). But a very short prompt (about 200 words) when using serverless takes about 1.3 to 1.5 seconds.
I tried to have "active workers" but that didn't help. Any tips on how to reduce the latency?
3 Replies
If active workers do not speed up your requests then you likely have something misconfigured. How are you loading your model? Is it baked into the image, loaded from network volume or downloaded during runtime?
How is the response time when you request ~ 1400 words from your serverless worker?
Thanks. I am loading from huggingface.
Goes up only slightly, to about 1.6 seconds.
So there is some overhead
There is going to be some overhead for proxying your request. I'm not sure if 1.1 to 1.3 seconds of delay might just be the cost of running through an API. You can look at the JSON returned to check delayTime compared to executionTime. This should show you where the latency you are experience is occurring.