My output is restricted to no of tokens
I have deployed llama 3.1 8b on serverless Vllm when i hit the req the response is always in limited no of tokens help me with this
2 Replies
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
The documentation is bad and doesn't tell you why this is. Here is why: https://discord.com/channels/912829806415085598/1279829584749138109
Even the "max_tokens" placement in the examples here is wrong.