RunPod•9mo ago

how to set a max output token

Hi, I deployed a finetuned llama 3 via vllm serverless on runpod. However, I'm getting limited output tokens everytime. Does anyone know if we can alter the max output tokens while sending the input prompt json?

7 Replies

Madiator2011•9mo ago

vllm does not support yet llama 3.1

Madiator2011•9mo ago

https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#environment-variables

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Jason•9mo ago

they're working to update vllm-worker

Heartthrob10OP•9mo ago

I'm not using llama 3.1, it's the old llama 3

Jason•9mo ago

hmm okay max output tokens? i think limited output tokens can be modified if you use openai pip package to request to runpod serverless try using that you can modify max output tokens

PatrickR•9mo ago

Are you asking how to set the Max Model Length parameter inside the vLLM worker? It is under LLM Settings.

Heartthrob10OP•9mo ago

No, this is more relevant to the context length right. I'm talking about output tokens This should do the job, let me try this

Gaming

Programming

how to set a max output token

Did you find this page helpful?