My output is restricted to no of tokens

I have deployed llama 3.1 8b on serverless Vllm when i hit the req the response is always in limited no of tokens help me with this
No description
No description
2 Replies
Madiator2011 (Work)
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
lostdev
lostdev3mo ago
The documentation is bad and doesn't tell you why this is. Here is why: https://discord.com/channels/912829806415085598/1279829584749138109 Even the "max_tokens" placement in the examples here is wrong.
Want results from more Discord servers?
Add your server