vLLM streaming ends prematurely
I'm having issues with my vLLM worker ending a generation early. When I send the same prompt to my API without "stream": true, the prompt returns fully. When "stream": true is added to the API, it stops early, sometimes right after {"user":"assistant"} gets sent. It was working earlier this AM, I see this in the system logs around the time that it stopped working:
2024-06-13T15:37:10Z create pod network
2024-06-13T15:37:10Z create container runpod/worker-vllm:stable-cuda12.1.0
2024-06-13T15:37:11Z start container
Was a newer version pushed? I see that there were two new updates pushed in the last 24 hours at the vllm_worker github repo.
20 Replies
cc: @Alpay Ariyak
Could you share full output?
Were you streaming w openai compatibility or not?
I'm using default environment variables, so openai compatibility should be enabled
So here's my request
{
"model": "my_model",
"messages": [
{
"role": "user",
"content": "Hi!"
}
],
"stream": true/false
}
When stream:false
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "Hi! How can I help you today?",
"role": "assistant"
},
"stop_reason": null
}
],
"created": 1718310772,
"id": "cmpl-edf2da6230e14a84b6b25861f29591d9",
"model": "S",
"object": "chat.completion",
"usage": {
"completion_tokens": 10,
"prompt_tokens": 13,
"total_tokens": 23
}
}
When stream:true
data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":"!"},"logprobs":null,"finish_reason":null}]}
data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":" How"},"logprobs":null,"finish_reason":null}]}
Let me know what else I can supply to help
After you send the streaming request and it finishes, can you go to the console and check status of that request, it should show full output from worker, need to see if it’s also cut off there
The request is too long to past here
So in the console/requests log, it looks like the full generation completed.
It looks like it says "Hello! How can I assist you today?" which completes what Postman received
Okay, that's great to know, so issue is outside of worker
we're still looking into this
Can you share your entire endpoint configuration
And your endpoint id please
Is there an easy way for me to export the configuration?
I have these two:
vllm-nutty_teal_junglefowl
vllm-kejv5lkoiilruc
I'm not sure which is the endpoint ID
Also, thank you so much for the help
The second one, I agree its confusing to tell which is the id haha
Of course!
Can you see the endpoint configuration from the ID?
Or should I try to copy all of the settings across?
Please do for now, I don’t have access atm to the settings
I'm not sure which settings are important but:
24 GB GPU
3 workers, 1 GPUs/worker
5 second idle timeout
Flashboot enabled
CA-MTL datacenters
12.1,12.2,12.3,12.4 CUDA versions allowed
4 seconds queue delay
L4, A5000, 3090 GPU types
For the endpoint template:
30 GB container disk
MODEL_NAME: my_model
BASE_PATH: /runpod-volume
HF_TOKEN: my_token
That's all the environment Variables that are set
Is CA-MTL-1 a requirement for you?
This seems isolated to that and US-OR
All others are good
Oh
No, CA-MTL-1 is not a requirement
I optimally would like ot be in Canada
This was fixed!
Sorry for the delay
WOOHOO
Thanks!
I wonder what happened
me 2
me 2
Its better to say what the problem was, what was done to fix it etc. Just saying its fixed is honestly not good enough.
It caused me a lot of grief. I’m very glad it’s fixed but it would be great to get more details and what the mitigation plan will be in the future.