Anyone get vLLM working with reasonable response times?
Seems that no matter how I configure serverless with vLLM, the workers are very awkward in picking up tasks and even with warm containers tasks in the queue sit around for minutes for no obvious reason. Has anyone actually been able to use serverless vLLM for a production use case?
3 Replies
@Anders
Escalated To Zendesk
The thread has been escalated to Zendesk!
for my experience, ive been using it for some testing, but not too much yet. had experienced this before, but only after making some changes to the endpoint ( long time ago )
try to open a ticket ya, let the support staff check it
I've spent too much time to optimize vLLM for that. But even though I am pushing tokens/s above the official benchmarks of the model and hardware combination, there is some overhead I can't do anything about: frequent worker shifts causing cold starts, various speeds in different data centres and different requests, and especially the delay time even when warm, which can be as long as the execution time itself. I think the serverless is suitable for starting up or smaller LLM projects. Go dedicated or self-host for big ones.
But even then, the delay times are a few seconds for me. They should not be minutes as you say. Which region do you use?