bo Posts - Answer Overflow

•Created by bo on 11/17/2024 in #⚡｜serverless

What are ttft times we should be able to reach?

Of course this depends on token inputs, hardware selection etc. But for the life of me, I cannot get a TTFT of under 2000 ms on serverless. I'm using llama 3.1 7b / gemma / mystral on 48 GB gpu workers. For performance evaluation I use guidellm which test for different throughput (continous, small, large) scenarios. Even with 50 input tokens and 100 output tokens I see 2000-2500 ms ttft. I should add that I'm running guideLLM from a local python script to the serverless endpoint. Has anyone observed quicker times?

2 replies

Gaming

Programming