R
RunPod4mo ago
madiator

Long latencies

I have a 7B model that is supposed to be very fast (it checks if a claim is supported by a context, and gives a yes/no answer). If I rent a H100, I can process my prompt and get a response in 100ms (for a prompt that's about 1400 words). But a very short prompt (about 200 words) when using serverless takes about 1.3 to 1.5 seconds. I tried to have "active workers" but that didn't help. Any tips on how to reduce the latency?
3 Replies
Encyrption
Encyrption4mo ago
If active workers do not speed up your requests then you likely have something misconfigured. How are you loading your model? Is it baked into the image, loaded from network volume or downloaded during runtime? How is the response time when you request ~ 1400 words from your serverless worker?
madiator
madiatorOP4mo ago
Thanks. I am loading from huggingface. Goes up only slightly, to about 1.6 seconds. So there is some overhead
Encyrption
Encyrption4mo ago
There is going to be some overhead for proxying your request. I'm not sure if 1.1 to 1.3 seconds of delay might just be the cost of running through an API. You can look at the JSON returned to check delayTime compared to executionTime. This should show you where the latency you are experience is occurring.
Want results from more Discord servers?
Add your server