RunPod•11mo ago

vLLM streaming ends prematurely

I'm having issues with my vLLM worker ending a generation early. When I send the same prompt to my API without "stream": true, the prompt returns fully. When "stream": true is added to the API, it stops early, sometimes right after {"user":"assistant"} gets sent. It was working earlier this AM, I see this in the system logs around the time that it stopped working: 2024-06-13T15:37:10Z create pod network 2024-06-13T15:37:10Z create container runpod/worker-vllm:stable-cuda12.1.0 2024-06-13T15:37:11Z start container Was a newer version pushed? I see that there were two new updates pushed in the last 24 hours at the vllm_worker github repo.

20 Replies

haris•11mo ago

cc: @Alpay Ariyak

Alpay Ariyak•11mo ago

Could you share full output? Were you streaming w openai compatibility or not?

shensmobileOP•11mo ago

I'm using default environment variables, so openai compatibility should be enabled So here's my request { "model": "my_model", "messages": [ { "role": "user", "content": "Hi!" } ], "stream": true/false } When stream:false { "choices": [ { "finish_reason": "stop", "index": 0, "logprobs": null, "message": { "content": "Hi! How can I help you today?", "role": "assistant" }, "stop_reason": null } ], "created": 1718310772, "id": "cmpl-edf2da6230e14a84b6b25861f29591d9", "model": "S", "object": "chat.completion", "usage": { "completion_tokens": 10, "prompt_tokens": 13, "total_tokens": 23 } } When stream:true data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"role":"assistant"},"logprobs":null,"finish_reason":null}]} data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":"Hello"},"logprobs":null,"finish_reason":null}]} data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":"!"},"logprobs":null,"finish_reason":null}]} data: {"id":"cmpl-a2dcf314291e45bcbb49e999c2218211","object":"chat.completion.chunk","created":1718310788,"model":"S","choices":[{"index":0,"delta":{"content":" How"},"logprobs":null,"finish_reason":null}]} Let me know what else I can supply to help

Alpay Ariyak•11mo ago

After you send the streaming request and it finishes, can you go to the console and check status of that request, it should show full output from worker, need to see if it’s also cut off there

shensmobileOP•11mo ago

The request is too long to past here

shensmobileOP•11mo ago

stream_true_request_...

shensmobileOP•11mo ago

So in the console/requests log, it looks like the full generation completed. It looks like it says "Hello! How can I assist you today?" which completes what Postman received

Alpay Ariyak•11mo ago

Okay, that's great to know, so issue is outside of worker we're still looking into this Can you share your entire endpoint configuration And your endpoint id please

shensmobileOP•11mo ago

Is there an easy way for me to export the configuration? I have these two: vllm-nutty_teal_junglefowl vllm-kejv5lkoiilruc I'm not sure which is the endpoint ID Also, thank you so much for the help

Alpay Ariyak•11mo ago

The second one, I agree its confusing to tell which is the id haha Of course!

shensmobileOP•11mo ago

Can you see the endpoint configuration from the ID? Or should I try to copy all of the settings across?

Alpay Ariyak•11mo ago

Please do for now, I don’t have access atm to the settings

shensmobileOP•11mo ago

I'm not sure which settings are important but: 24 GB GPU 3 workers, 1 GPUs/worker 5 second idle timeout Flashboot enabled CA-MTL datacenters 12.1,12.2,12.3,12.4 CUDA versions allowed 4 seconds queue delay L4, A5000, 3090 GPU types For the endpoint template: 30 GB container disk MODEL_NAME: my_model BASE_PATH: /runpod-volume HF_TOKEN: my_token That's all the environment Variables that are set

Alpay Ariyak•11mo ago

Is CA-MTL-1 a requirement for you? This seems isolated to that and US-OR All others are good

shensmobileOP•11mo ago

Oh No, CA-MTL-1 is not a requirement I optimally would like ot be in Canada

Alpay Ariyak•11mo ago

This was fixed! Sorry for the delay

shensmobileOP•11mo ago

WOOHOO Thanks! I wonder what happened