how to set a max output token

Hi, I deployed a finetuned llama 3 via vllm serverless on runpod. However, I'm getting limited output tokens everytime. Does anyone know if we can alter the max output tokens while sending the input prompt json?
7 Replies
Madiator2011
Madiator20115mo ago
vllm does not support yet llama 3.1
Madiator2011
Madiator20115mo ago
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
nerdylive
nerdylive5mo ago
they're working to update vllm-worker
Heartthrob10
Heartthrob10OP5mo ago
I'm not using llama 3.1, it's the old llama 3
nerdylive
nerdylive5mo ago
hmm okay max output tokens? i think limited output tokens can be modified if you use openai pip package to request to runpod serverless try using that you can modify max output tokens
PatrickR
PatrickR5mo ago
Are you asking how to set the Max Model Length parameter inside the vLLM worker? It is under LLM Settings.
No description
Heartthrob10
Heartthrob10OP5mo ago
No, this is more relevant to the context length right. I'm talking about output tokens This should do the job, let me try this
Want results from more Discord servers?
Add your server