bo
bo
RRunPod
Created by bo on 11/17/2024 in #⚡|serverless
What are ttft times we should be able to reach?
Of course this depends on token inputs, hardware selection etc. But for the life of me, I cannot get a TTFT of under 2000 ms on serverless. I'm using llama 3.1 7b / gemma / mystral on 48 GB gpu workers. For performance evaluation I use guidellm which test for different throughput (continous, small, large) scenarios. Even with 50 input tokens and 100 output tokens I see 2000-2500 ms ttft. I should add that I'm running guideLLM from a local python script to the serverless endpoint. Has anyone observed quicker times?
2 replies