R
RunPod•3w ago
houmie

RUNPOD_API_KEY and MAX_CONTEXT_LEN_TO_CAPTURE

We are also starting a vLLM project and I have two questions: 1) In the environment variables, do I have to define the RUNPOD_API_KEY with my own secret key to access the final vLLM OpenAI endpoint? 2) Isn't MAX_CONTEXT_LEN_TO_CAPTURE now deprecated? Do we still need to provide it, if MAX_MODEL_LEN is already set? Thank you
14 Replies
houmie
houmie•3w ago
After some try and error, I figured out the solution to 1) that the RUNPOD_API_KEY has no effect. We need to use the actual API KEY that can be generated under accounts -> Settings to access the OpenAI Url.
I'm still not quite certain how to set the model length. I'm getting this error right now:
ValueError: User-specified max_model_len (16384) is greater than the derived max_model_len (max_position_embeddings=8192 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value
ValueError: User-specified max_model_len (16384) is greater than the derived max_model_len (max_position_embeddings=8192 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. Make sure the value
Llama-3 supports 8192 tokens, however I was expecting that it would use RoPE to automatically increase it. Is this not how it's done? RoPE scaling is supported in vLLM: https://github.com/vllm-project/vllm/pull/555
nerdylive
nerdylive•3w ago
yes ValueError: User-specified max_model_len (16384) set it on your env max model len to 8192 Oh not sure of how it works
houmie
houmie•3w ago
Yeah that is easily done with Aphrodite-engine to increase the model length (by using more memory). vLLM is quite limited. But based on that PR it must be possible, just not so easy I guess.
digigoblin
digigoblin•3w ago
You are right @nerdylive , but its called MAX_MODEL_LEN I don't see how its possible to set the max_model_len to a value thats higher than whats supported by the model, that doesn't make sense to me @houmie @Alpay Ariyak is the best person to advise on this.
nerdylive
nerdylive•3w ago
Ill try to add the support for RoPE
houmie
houmie•3w ago
In Aphrodite-engine I can set CONTEXT_LENGTH to 16384 and it automatically uses RoPE scaling, in return it requires more memory. See bullet point 3 (https://github.com/PygmalionAI/aphrodite-engine?tab=readme-ov-file#notes) I'm using that right now on production. It is really possible 🙂 Guys I really hope you can help me with bullet point 1 about API-KEYS. Is there a way I could define the API-KEY for vLLM myself instead of having RunPod creating it for me? This last one is quite urgent due a migration request.
nerdylive
nerdylive•3w ago
ill try to apply that on vllm worker too Will you try the image to test if it works
houmie
houmie•3w ago
Of course, happy to help.
nerdylive
nerdylive•3w ago
Alright wait
houmie
houmie•3w ago
Thank you. And sorry do you know by any chance about the API-KEY issue? I hope there is a way.
digigoblin
digigoblin•3w ago
What is the API key issue? You have to generate an API key in the RunPod web console and use it to make requestes, you can't use a custom API key, you have to use a RunPod one for RunPod serverless to function correctly.
digigoblin
digigoblin•3w ago
This is also pretty clear in the docs: https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
houmie
houmie•3w ago
I see. Ok, so there is no way to set a custom key. Thanks
digigoblin
digigoblin•3w ago
Nope, not possible, create your own backend as a proxy to serverless if you want to use custom API keys