R
RunPod9mo ago
houmie

Which version of vLLM is installed on Serverless?

There is currently a bug on vLLM that causes Llama3 to not utilising the stop tokens correctly. This has been fixed in v0.4.1. https://github.com/vllm-project/vllm/issues/4180#issuecomment-2074017550 I was wondering what is the version of vLLM on the serverless. Thanks
43 Replies
houmie
houmieOP9mo ago
I can clearly replicate this issue on vLLM Serverless. e.g. the stopping words don't work and keep getting regenerated: <|eot_id|><|start_header_id|>assistant<|end_header_id|> This is a show stopper for us as we need to host LLama-3 in a functioning state. As it stands unless RunPod upgrades its vLLM to v0.4.1 this bug will reoccur each time.
Madiator2011
Madiator20119mo ago
responded on zendesk
arthrod.
arthrod.9mo ago
Having the same issue..
houmie
houmieOP9mo ago
I'm going to share the ZenDenk answer here so more people can benefit from it:
From what I see on repo it's runing version 0.3.3
https://github.com/runpod-workers/worker-vllm
From what I see on repo it's runing version 0.3.3
https://github.com/runpod-workers/worker-vllm
The worker-vllm might be 0.3.3, but does it contain the latest vLLM (0.4.1)? I suppose we need to ping the maintainer of worker-vllm for that. Can someone from RunPod confirm that at least the worker is always pulling the latest or is it fixed on 0.3.3? Thanks
nerdylive
nerdylive9mo ago
I think it's fixed to that version yeah I'll try to make a pull request latwrb Later
houmie
houmieOP9mo ago
Thank you, that would be amazing!!! Should I also create a feedback for this task here?
nerdylive
nerdylive9mo ago
hmm i see theres a pr already for updating the version last time i checked i dont know if it merged yet lets just wait
houmie
houmieOP9mo ago
Sure, thanks.
arthrod.
arthrod.9mo ago
Guys, this is pretty serious issue... One of the main models out there can't be served by RunPod..
digigoblin
digigoblin9mo ago
You can serve it, just not with the vllm worker. Its better to log issues on GitHub for this kind of thing than Discord.
digigoblin
digigoblin9mo ago
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
digigoblin
digigoblin9mo ago
I see there is already an issue for it - https://github.com/runpod-workers/worker-vllm/issues/66
GitHub
How can i update to vLLM v0.4.1 for llama3 support ? · Issue #66 · ...
Hello everyone, I would like to update the vLLM version to v0.4.1 in order to get access to LLAMA3 but i don't know how modify the fork runpod/vllm-fork-for-sls-worker. Could you please guide m...
Madiator2011
Madiator20119mo ago
@digigoblin there should be update out today or tomorrow
arthrod.
arthrod.9mo ago
Thank you! Pretty please
Alpay Ariyak
Alpay Ariyak9mo ago
had a few blockers, releasing later tonight or tomorrow afternoon
Builderman
Builderman9mo ago
@Alpay Ariyak hey, i've been using worker-vllm for months now and the streaming broke all of a sudden no changes on my end not sure if runpod is down
base_events.py :1771 2024-05-08 01:23:56,733 Task exception was never retrieved
2024-05-08T01:23:56.735080453Z future: <Task finished name='Task-8514' coro=<_process_job() done, defined at /usr/local/lib/python3.11/dist-packages/runpod/serverless/worker.py:41> exception=UnicodeDecodeError('utf-8', b'\xff\xff\xff\xff\xff\x00\x06', 0, 1, 'invalid start byte')>
2024-05-08T01:23:56.735088543Z Traceback (most recent call last):
2024-05-08T01:23:56.735095700Z File "/usr/local/lib/python3.11/dist-packages/runpod/serverless/worker.py", line 55, in _process_job
2024-05-08T01:23:56.735105243Z await stream_result(session, stream_output, job)
2024-05-08T01:23:56.735118162Z File "/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_http.py", line 76, in stream_result
2024-05-08T01:23:56.735128930Z await _handle_result(session, job_data, job, JOB_STREAM_URL, "Intermediate results sent.")
2024-05-08T01:23:56.735138994Z File "/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_http.py", line 50, in _handle_result
2024-05-08T01:23:56.735158189Z await _transmit(session, url, serialized_job_data)
2024-05-08T01:23:56.735169194Z File "/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_http.py", line 37, in _transmit
2024-05-08T01:23:56.735178882Z await client_response.text()
2024-05-08T01:23:56.735187102Z File "/usr/local/lib/python3.11/dist-packages/aiohttp/client_reqrep.py", line 1147, in text
2024-05-08T01:23:56.735196332Z return self._body.decode( # type: ignore[no-any-return,union-attr]
2024-05-08T01:23:56.735204926Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-08T01:23:56.735211906Z UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
base_events.py :1771 2024-05-08 01:23:56,733 Task exception was never retrieved
2024-05-08T01:23:56.735080453Z future: <Task finished name='Task-8514' coro=<_process_job() done, defined at /usr/local/lib/python3.11/dist-packages/runpod/serverless/worker.py:41> exception=UnicodeDecodeError('utf-8', b'\xff\xff\xff\xff\xff\x00\x06', 0, 1, 'invalid start byte')>
2024-05-08T01:23:56.735088543Z Traceback (most recent call last):
2024-05-08T01:23:56.735095700Z File "/usr/local/lib/python3.11/dist-packages/runpod/serverless/worker.py", line 55, in _process_job
2024-05-08T01:23:56.735105243Z await stream_result(session, stream_output, job)
2024-05-08T01:23:56.735118162Z File "/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_http.py", line 76, in stream_result
2024-05-08T01:23:56.735128930Z await _handle_result(session, job_data, job, JOB_STREAM_URL, "Intermediate results sent.")
2024-05-08T01:23:56.735138994Z File "/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_http.py", line 50, in _handle_result
2024-05-08T01:23:56.735158189Z await _transmit(session, url, serialized_job_data)
2024-05-08T01:23:56.735169194Z File "/usr/local/lib/python3.11/dist-packages/runpod/serverless/modules/rp_http.py", line 37, in _transmit
2024-05-08T01:23:56.735178882Z await client_response.text()
2024-05-08T01:23:56.735187102Z File "/usr/local/lib/python3.11/dist-packages/aiohttp/client_reqrep.py", line 1147, in text
2024-05-08T01:23:56.735196332Z return self._body.decode( # type: ignore[no-any-return,union-attr]
2024-05-08T01:23:56.735204926Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-05-08T01:23:56.735211906Z UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
nerdylive
nerdylive9mo ago
"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" its not from runpod, its a problem from your software. encoding an invalid start byte decoding*
digigoblin
digigoblin9mo ago
RunPod actually confirmed there was an issue due to enabling compression and it has now been resolved.
nerdylive
nerdylive9mo ago
oh nice
houmie
houmieOP9mo ago
Did you enable compression or what was that?
Alpay Ariyak
Alpay Ariyak9mo ago
Is anyone interested in testing out whether the new image(haven’t merged into main yet) works with their codebase/usecase? vLLM v0.4.2 based
houmie
houmieOP9mo ago
Yeah sure. I can test.
arthrod.
arthrod.9mo ago
Happy to test but need instructions haha
houmie
houmieOP9mo ago
To see the error you need to test the Chat completion instead of Text Completion. You can use Silly Tavern or even Oobabooga (Best to test Text and Chat)
arthrod.
arthrod.9mo ago
And considering text completion is set to bd deprecated....
houmie
houmieOP9mo ago
Can you elaborate on that?
arthrod.
arthrod.9mo ago
Not now but eventually the protocol will lose the text completion endpoint https://platform.openai.com/docs/api-reference/introduction
houmie
houmieOP9mo ago
Are you sure https://api.openai.com/v1/completions gets deprecated? I don't see that being said there.
arthrod.
arthrod.9mo ago
Not in the moment
No description
houmie
houmieOP9mo ago
Ahh interesting. Thanks. So it's recommended to use https://api.openai.com/v1/chat/completions ?
arthrod.
arthrod.9mo ago
Exactly! So we really need that compatible
digigoblin
digigoblin9mo ago
Did you log a Github issue for this?
Alpay Ariyak
Alpay Ariyak9mo ago
Both are compatible Completions endpoint is for completion(base) models, eg llama 8B Chat Completions is for chat/instruction models, eg llama 8b instruct
Alpay Ariyak
Alpay Ariyak9mo ago
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
nuck
nuck9mo ago
I'd be happy to test the vllm upgrade as if you still need ppl
Alpay Ariyak
Alpay Ariyak9mo ago
To test, change image to alpayariyakrunpod/worker-vllm:1.0.0-cuda12.1.0 Everything else about the endpoint can stay the same
nuck
nuck9mo ago
Would you like us to test by creating an image with a baked in model as well?
Alpay Ariyak
Alpay Ariyak9mo ago
No baked in is fine for now
arthrod.
arthrod.9mo ago
Fixed!
Alpay Ariyak
Alpay Ariyak9mo ago
@here, I have just merged the vLLM 0.4.2 update into main, you can use it by changing your Docker image in your endpoint from runpod/worker-vllm:stable-cudaX.X.X to runpod/worker-vllm:dev-cudaX.X.X, key change being dev instead of stable. From my testing thus far, everything seems in order, but if you notice any issues, please let me know. After an initial test period, I'll release the update officially to replace the default stable images. Thanks all!
houmie
houmieOP9mo ago
I can confirm your patched image alpayariyakrunpod/worker-vllm:1.0.0-cuda12.1.0 is now working with chat/completion I didn't test the dev-cuda Thanks
Alpay Ariyak
Alpay Ariyak9mo ago
Thanks for the feedback, stable should've been working as well, it's not?
houmie
houmieOP9mo ago
Is stable now released with the latest version (vLLM 0.4.2)? if that's the case I can test it again this afternoon. Let me know.

Did you find this page helpful?