R
RunPod2mo ago
bo

What are ttft times we should be able to reach?

Of course this depends on token inputs, hardware selection etc. But for the life of me, I cannot get a TTFT of under 2000 ms on serverless. I'm using llama 3.1 7b / gemma / mystral on 48 GB gpu workers. For performance evaluation I use guidellm which test for different throughput (continous, small, large) scenarios. Even with 50 input tokens and 100 output tokens I see 2000-2500 ms ttft. I should add that I'm running guideLLM from a local python script to the serverless endpoint. Has anyone observed quicker times?
1 Reply
yhlong00000
yhlong000002mo ago
Maybe try different GPU types? 48 pro, 80, 80 pro
Want results from more Discord servers?
Add your server