vLLM streaming ends prematurely

I'm having issues with my vLLM worker ending a generation early. When I send the same prompt to my API without "stream": true, the prompt returns fully. When "stream": true is added to the API, it stops early, sometimes right after {"user":"assistant"} gets sent. It was working earlier this AM, I see this in the system logs around the time that it stopped working: 2024-06-13T15:37:10Z create pod network 2024-06-13T15:37:10Z create container runpod/worker-vllm:stable-cuda12.1.0 2024-06-13T15:37:11Z start container Was a newer version pushed? I see that there were two new updates pushed in the last 24 hours at the vllm_worker github repo.
20 Replies
haris
haris4w ago
cc: @Alpay Ariyak
Alpay Ariyak
Alpay Ariyak4w ago
Could you share full output? Were you streaming w openai compatibility or not?
shensmobile
shensmobile4w ago
I'm using default environment variables, so openai compatibility should be enabled So here's my request { "model": "my_model", "messages": [ { "role": "user", "content": "Hi!" } ], "stream": true/false } When stream:false { "choices": [ { "finish_reason": "stop", "index": 0, "logprobs": null, "message": { "content": "Hi! How can I help you today?", "role": "assistant" }, "stop_reason": null } ], "created": 1718310772, "id": "cmpl-edf2da6230e14a84b6b25861f29591d9", "model": "S", "object": "chat.completion", "usage": { "completion_tokens": 10, "prompt_tokens": 13, "total_tokens": 23 } } When stream:true data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]} data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}]} data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":"!"},"logprobs":null,"finish_reason":null}]} data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":" How"},"logprobs":null,"finish_reason":null}]} Let me know what else I can supply to help
Alpay Ariyak
Alpay Ariyak4w ago
After you send the streaming request and it finishes, can you go to the console and check status of that request, it should show full output from worker, need to see if it’s also cut off there
shensmobile
shensmobile4w ago
The request is too long to past here
shensmobile
shensmobile4w ago
So in the console/requests log, it looks like the full generation completed. It looks like it says "Hello! How can I assist you today?" which completes what Postman received
Alpay Ariyak
Alpay Ariyak4w ago
Okay, that's great to know, so issue is outside of worker we're still looking into this Can you share your entire endpoint configuration And your endpoint id please
shensmobile
shensmobile4w ago
Is there an easy way for me to export the configuration? I have these two: vllm-nutty_teal_junglefowl vllm-kejv5lkoiilruc I'm not sure which is the endpoint ID Also, thank you so much for the help
Alpay Ariyak
Alpay Ariyak4w ago
The second one, I agree its confusing to tell which is the id haha Of course!
shensmobile
shensmobile4w ago
Can you see the endpoint configuration from the ID? Or should I try to copy all of the settings across?
Alpay Ariyak
Alpay Ariyak4w ago
Please do for now, I don’t have access atm to the settings
shensmobile
shensmobile4w ago
I'm not sure which settings are important but: 24 GB GPU 3 workers, 1 GPUs/worker 5 second idle timeout Flashboot enabled CA-MTL datacenters 12.1,12.2,12.3,12.4 CUDA versions allowed 4 seconds queue delay L4, A5000, 3090 GPU types For the endpoint template: 30 GB container disk MODEL_NAME: my_model BASE_PATH: /runpod-volume HF_TOKEN: my_token That's all the environment Variables that are set
Alpay Ariyak
Alpay Ariyak4w ago
Is CA-MTL-1 a requirement for you? This seems isolated to that and US-OR All others are good
shensmobile
shensmobile4w ago
Oh No, CA-MTL-1 is not a requirement I optimally would like ot be in Canada
Alpay Ariyak
Alpay Ariyak4w ago
This was fixed! Sorry for the delay
shensmobile
shensmobile4w ago
WOOHOO Thanks! I wonder what happened
nerdylive
nerdylive4w ago
me 2
digigoblin
digigoblin4w ago
me 2 Its better to say what the problem was, what was done to fix it etc. Just saying its fixed is honestly not good enough.
shensmobile
shensmobile4w ago
It caused me a lot of grief. I’m very glad it’s fixed but it would be great to get more details and what the mitigation plan will be in the future.