What are ttft times we should be able to reach?
Of course this depends on token inputs, hardware selection etc. But for the life of me, I cannot get a TTFT of under 2000 ms on serverless.
I'm using llama 3.1 7b / gemma / mystral on 48 GB gpu workers.
For performance evaluation I use guidellm which test for different throughput (continous, small, large) scenarios. Even with 50 input tokens and 100 output tokens I see 2000-2500 ms ttft.
I should add that I'm running guideLLM from a local python script to the serverless endpoint. Has anyone observed quicker times?
1 Reply
Maybe try different GPU types? 48 pro, 80, 80 pro