madiator
madiator
RRunPod
Created by madiator on 8/21/2024 in #⚡|serverless
Long latencies
I have a 7B model that is supposed to be very fast (it checks if a claim is supported by a context, and gives a yes/no answer). If I rent a H100, I can process my prompt and get a response in 100ms (for a prompt that's about 1400 words). But a very short prompt (about 200 words) when using serverless takes about 1.3 to 1.5 seconds. I tried to have "active workers" but that didn't help. Any tips on how to reduce the latency?
5 replies