Delay on startup: How long for low usage?
I am trying to gauge the actual cold start for a 7B LLM deployed with vLLM.
My ideal configuration is something like this: 0 active workers, 5 requests/hour, and up to between 100-200 seconds of generation time.
How long would it take for RunPod to do a cold start with delay time and everything? Essentially, what is the min, avg, max in terms of time to first token generated?
0 Replies