Which version of vLLM is installed on Serverless?
There is currently a bug on vLLM that causes Llama3 to not utilising the stop tokens correctly.
This has been fixed in v0.4.1.
https://github.com/vllm-project/vllm/issues/4180#issuecomment-2074017550
I was wondering what is the version of vLLM on the serverless. Thanks
43 Replies
I can clearly replicate this issue on vLLM Serverless.
e.g. the stopping words don't work and keep getting regenerated: <|eot_id|><|start_header_id|>assistant<|end_header_id|>
This is a show stopper for us as we need to host LLama-3 in a functioning state. As it stands unless RunPod upgrades its vLLM to v0.4.1 this bug will reoccur each time.
responded on zendesk
Having the same issue..
I'm going to share the ZenDenk answer here so more people can benefit from it:
The worker-vllm might be 0.3.3, but does it contain the latest vLLM (0.4.1)? I suppose we need to ping the maintainer of worker-vllm for that.
Can someone from RunPod confirm that at least the worker is always pulling the latest or is it fixed on 0.3.3? Thanks
I think it's fixed to that version yeah
I'll try to make a pull request latwrb
Later
Thank you, that would be amazing!!!
Should I also create a feedback for this task here?
hmm i see theres a pr already for updating the version last time i checked
i dont know if it merged yet
lets just wait
Sure, thanks.
Guys, this is pretty serious issue... One of the main models out there can't be served by RunPod..
You can serve it, just not with the vllm worker.
Its better to log issues on GitHub for this kind of thing than Discord.
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
I see there is already an issue for it - https://github.com/runpod-workers/worker-vllm/issues/66
GitHub
How can i update to vLLM v0.4.1 for llama3 support ? · Issue #66 · ...
Hello everyone, I would like to update the vLLM version to v0.4.1 in order to get access to LLAMA3 but i don't know how modify the fork runpod/vllm-fork-for-sls-worker. Could you please guide m...
@digigoblin there should be update out today or tomorrow
Thank you!
Pretty please
had a few blockers, releasing later tonight or tomorrow afternoon
@Alpay Ariyak hey, i've been using worker-vllm for months now and the streaming broke all of a sudden
no changes on my end
not sure if runpod is down
"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte"
its not from runpod, its a problem from your software.
encoding an invalid start byte
decoding*
RunPod actually confirmed there was an issue due to enabling compression and it has now been resolved.
oh nice
Did you enable compression or what was that?
Is anyone interested in testing out whether the new image(haven’t merged into main yet) works with their codebase/usecase?
vLLM v0.4.2 based
Yeah sure. I can test.
Happy to test but need instructions haha
To see the error you need to test the Chat completion instead of Text Completion.
You can use Silly Tavern or even Oobabooga
(Best to test Text and Chat)
And considering text completion is set to bd deprecated....
Can you elaborate on that?
Not now but eventually the protocol will lose the text completion endpoint https://platform.openai.com/docs/api-reference/introduction
Are you sure
https://api.openai.com/v1/completions
gets deprecated? I don't see that being said there.Not in the moment
Ahh interesting. Thanks. So it's recommended to use
https://api.openai.com/v1/chat/completions
?Exactly! So we really need that compatible
Did you log a Github issue for this?
Both are compatible
Completions endpoint is for completion(base) models, eg llama 8B
Chat Completions is for chat/instruction models, eg llama 8b instruct
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
I'd be happy to test the vllm upgrade as if you still need ppl
To test, change image to
alpayariyakrunpod/worker-vllm:1.0.0-cuda12.1.0
Everything else about the endpoint can stay the sameWould you like us to test by creating an image with a baked in model as well?
No baked in is fine for now
Fixed!
@here, I have just merged the vLLM 0.4.2 update into main, you can use it by changing your Docker image in your endpoint from
runpod/worker-vllm:stable-cudaX.X.X
to runpod/worker-vllm:dev-cudaX.X.X
, key change being dev
instead of stable
. From my testing thus far, everything seems in order, but if you notice any issues, please let me know. After an initial test period, I'll release the update officially to replace the default stable images. Thanks all!I can confirm your patched image
alpayariyakrunpod/worker-vllm:1.0.0-cuda12.1.0
is now working with chat/completion
I didn't test the dev-cuda
ThanksThanks for the feedback, stable should've been working as well, it's not?
Is
stable
now released with the latest version (vLLM 0.4.2)? if that's the case I can test it again this afternoon. Let me know.