RunPod•6mo ago

Running a specific Model Revision on Serverless Worker VLLM

How do I specify the model revision on serverless? I was looking through the readme in https://github.com/runpod-workers/worker-vllm and I see I can build a docker image with the revision I want, but is that the only way to go about this? Specifically, I wanna setup this huggingface model: https://huggingface.co/anthracite-org/magnum-v2-123b-exl2 edit: fixed the model link

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

anthracite-org/magnum-v2-123b-exl2 · Hugging Face

35 Replies

NERDDISCO•6mo ago

@nalak when you create the endpoint, you can configure envirioonment variables. One of them is called MODEL_NAME and this accepts any supported model you want from HF. So what you can do is: MODEL_NAME - anthracite-org/magnum-v2-123b-gguf

nalakOP•6mo ago

wait my bad I posted the wrong link

nalakOP•6mo ago

https://huggingface.co/anthracite-org/magnum-v2-123b-exl2

anthracite-org/magnum-v2-123b-exl2 · Hugging Face

NERDDISCO•6mo ago

You can also use "Quick Deploy" when you go into "Serverless". There we have a wizard to setup the endpoint called "Serverless vLLM". The result is the same thing in the end.

nalakOP•6mo ago

it's empty without a revision so it just runs nothing

NERDDISCO•6mo ago

AHH I see, you mean you want to change to a specific branch?

nalakOP•6mo ago

yeah I thought they were called revisions on hf, are they just branches like in git?

NERDDISCO•6mo ago

As hf is also just a git provider, I would just call this a branch. I think what the model owners mean is that you can get a specific revision of their model, but they use a git branch to distribute those. (At least this is how I understand it)

nalakOP•6mo ago

that sounds correct to me yeah is there a configuration option somewhere for the branch/revision?

nalakOP•6mo ago

I found this, but then I'd have to build the 40gb image and put it somewhere

NERDDISCO•6mo ago

According to the vLLM docs: Revision: The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

nalakOP•6mo ago

uh, I can't find that on the page

nalakOP•6mo ago

NERDDISCO•6mo ago

This was from the official docs: https://docs.vllm.ai/en/v0.3.3/models/engine_args.html#cmdoption-revision

nalakOP•6mo ago

ahhhhhhh, got it, nice

NERDDISCO•6mo ago

ok so looking at the code from vLLM-worker, I think we just forgot to add this into the README, but it seems that using this via env variables does also work: MODEL_REVISION

NERDDISCO•6mo ago

https://github.com/runpod-workers/worker-vllm/blob/2111c9e7a509ae90a285f99fabbebd22567ac8cb/src/download_model.py#L73

GitHub

worker-vllm/src/download_model.py at 2111c9e7a509ae90a285f99fabbebd...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

NERDDISCO•6mo ago

so if you have the time, can you please try this?

nalakOP•6mo ago

I tried that and it didn't seem to do anything actually

NERDDISCO•6mo ago

ok thank you, then this is a bug. It should work as far as I understand it

nalakOP•6mo ago

I may have misconfigured something but I was getting this error message, so I presume the model_revision var was ignored

NERDDISCO•6mo ago

would you mind showing me the env variables that you have configured?

nalakOP•6mo ago

just like this, right?

NERDDISCO•6mo ago

yes, this should be totally fine could you also please share the exact docker image that you used? then I'm opening a bug in our repo to get this fixed

nalakOP•6mo ago

I'm just using the vanilla vllm thing

NERDDISCO•6mo ago

ok perfect, thank you then I'm afraid the only solution for RIGHT NOW is to either build the image yourself OR you create a copy of the repo on hf into your account and put the model revision you want on main

nalakOP•6mo ago

oof

NERDDISCO•6mo ago

😦 but I will create the bug report now and push this internally

nalakOP•6mo ago

I'll just wait for the fix, not in that big of a hurry thanks for the support

NERDDISCO•6mo ago

While creating the issue on GitHub, I also tried to find out what we have to do and it looks like that both of these env variables must be set: * MODEL_REVISION * TOKENIZER_REVISION

NERDDISCO•6mo ago

https://github.com/runpod-workers/worker-vllm/issues/100

GitHub

MODEL_REVISION & TOKENIZER_REVISION: Both are needed to configure t...

When someone wants to use a different revision of a model, they need to specify the revision. Looking at the README, it is not clear how to do that. My first assumption would be to use MODEL_REVISI...

NERDDISCO•6mo ago

After I configured both, then it was able to load the model in the desired revision

NERDDISCO•6mo ago

BUT the model you want is using a quantization method "exl2" which is not supported by vLLM yet: https://github.com/vllm-project/vllm/issues/3203

GitHub

ExLlamaV2: exl2 support · Issue #3203 · vllm-project/vllm

If is possible ExLlamaV2 is a very fast and good library to Run LLM ExLlamaV2 Repo

nalakOP•6mo ago

okay, I see so I'd basically need to set up the container on my own with the proper deps to run the model fuck thanks

NERDDISCO•6mo ago

If you want to run this model with this quantization method, then you can't use it with vLLM right now I'm not sure if there is any other inference server which provides support for this? But if you come along one, then please let us know, so that we can also add it to our stack

Gaming

Programming

Running a specific Model Revision on Serverless Worker VLLM

Did you find this page helpful?