RunPod•6mo ago

Long latencies

I have a 7B model that is supposed to be very fast (it checks if a claim is supported by a context, and gives a yes/no answer). If I rent a H100, I can process my prompt and get a response in 100ms (for a prompt that's about 1400 words). But a very short prompt (about 200 words) when using serverless takes about 1.3 to 1.5 seconds. I tried to have "active workers" but that didn't help. Any tips on how to reduce the latency?

3 Replies

Encyrption•6mo ago

If active workers do not speed up your requests then you likely have something misconfigured. How are you loading your model? Is it baked into the image, loaded from network volume or downloaded during runtime? How is the response time when you request ~ 1400 words from your serverless worker?

madiatorOP•6mo ago

Thanks. I am loading from huggingface. Goes up only slightly, to about 1.6 seconds. So there is some overhead

Encyrption•6mo ago

There is going to be some overhead for proxying your request. I'm not sure if 1.1 to 1.3 seconds of delay might just be the cost of running through an API. You can look at the JSON returned to check delayTime compared to executionTime. This should show you where the latency you are experience is occurring.

Gaming

Programming

Long latencies

Did you find this page helpful?