Running a specific Model Revision on Serverless Worker VLLM
How do I specify the model revision on serverless? I was looking through the readme in https://github.com/runpod-workers/worker-vllm and I see I can build a docker image with the revision I want, but is that the only way to go about this?
Specifically, I wanna setup this huggingface model: https://huggingface.co/anthracite-org/magnum-v2-123b-exl2
edit: fixed the model link
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
35 Replies
@nalak when you create the endpoint, you can configure envirioonment variables. One of them is called
MODEL_NAME
and this accepts any supported model you want from HF. So what you can do is:
MODEL_NAME
- anthracite-org/magnum-v2-123b-gguf
wait my bad I posted the wrong link
You can also use "Quick Deploy" when you go into "Serverless". There we have a wizard to setup the endpoint called "Serverless vLLM". The result is the same thing in the end.
it's empty without a revision
so it just runs nothing
AHH I see, you mean you want to change to a specific branch?
yeah
I thought they were called revisions on hf, are they just branches like in git?
As hf is also just a git provider, I would just call this a branch. I think what the model owners mean is that you can get a specific revision of their model, but they use a git branch to distribute those. (At least this is how I understand it)
that sounds correct to me yeah
is there a configuration option somewhere for the branch/revision?
I found this, but then I'd have to build the 40gb image and put it somewhere
According to the vLLM docs:
Revision: The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.
uh, I can't find that on the page
This was from the official docs: https://docs.vllm.ai/en/v0.3.3/models/engine_args.html#cmdoption-revision
ahhhhhhh, got it, nice
ok so looking at the code from vLLM-worker, I think we just forgot to add this into the README, but it seems that using this via env variables does also work:
MODEL_REVISION
GitHub
worker-vllm/src/download_model.py at 2111c9e7a509ae90a285f99fabbebd...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
so if you have the time, can you please try this?
I tried that and it didn't seem to do anything actually
ok thank you, then this is a bug. It should work as far as I understand it
I may have misconfigured something but I was getting this error message, so I presume the model_revision var was ignored
would you mind showing me the env variables that you have configured?
just like this, right?
yes, this should be totally fine
could you also please share the exact docker image that you used?
then I'm opening a bug in our repo to get this fixed
I'm just using the vanilla vllm thing
ok perfect, thank you
then I'm afraid the only solution for RIGHT NOW is to either build the image yourself OR you create a copy of the repo on hf into your account and put the model revision you want on main
oof
😦
but I will create the bug report now and push this internally
I'll just wait for the fix, not in that big of a hurry
thanks for the support
While creating the issue on GitHub, I also tried to find out what we have to do and it looks like that both of these env variables must be set:
* MODEL_REVISION
* TOKENIZER_REVISION
GitHub
MODEL_REVISION & TOKENIZER_REVISION: Both are needed to configure t...
When someone wants to use a different revision of a model, they need to specify the revision. Looking at the README, it is not clear how to do that. My first assumption would be to use MODEL_REVISI...
After I configured both, then it was able to load the model in the desired revision
BUT the model you want is using a quantization method "exl2" which is not supported by vLLM yet: https://github.com/vllm-project/vllm/issues/3203
GitHub
ExLlamaV2: exl2 support · Issue #3203 · vllm-project/vllm
If is possible ExLlamaV2 is a very fast and good library to Run LLM ExLlamaV2 Repo
okay, I see
so I'd basically need to set up the container on my own with the proper deps to run the model
fuck
thanks
If you want to run this model with this quantization method, then you can't use it with vLLM right now
I'm not sure if there is any other inference server which provides support for this? But if you come along one, then please let us know, so that we can also add it to our stack