Topics

RunPod•7mo ago

My output is restricted to no of tokens

I have deployed llama 3.1 8b on serverless Vllm when i hit the req the response is always in limited no of tokens help me with this

No description

No description

2 Replies

Madiator2011 (Work)•7mo ago

https://github.com/runpod-workers/worker-vllm?tab=readme-ov-file#environment-variablessettings

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

lostdev•7mo ago

The documentation is bad and doesn't tell you why this is. Here is why: https://discord.com/channels/912829806415085598/1279829584749138109 Even the "max_tokens" placement in the examples here is wrong.

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

17KMembers

View on Discord

Did you find this page helpful?