Length of output of serverless meta-llama/Llama-3.1-8B-Instruct
When I submit a request I get a response that is always 100 tokens.
"max_tokens" or "max_new_tokens" have no effect.
How do I control the number of output tokens?
input:
{
"input": {
"prompt": "Give a pancake recipe"
},
"max_tokens": 5000,
"temperature": 1
}
output:
{
"delayTime": 1048,
"executionTime": 2593,
"id": "c444e5bb-aeca-4489-baf3-22bbe848b48c-e1",
"output": [
{
"choices": [
{
"tokens": [
" that is made with apples and cinnamon, and also includes a detailed outline of instructions that can be make it.\nTODAY'S PANCAKE RECIPE\n\n"A wonderful breakfast or brunch food that's made with apples and cinnamon."\n\nINGREDIENTS\n4 large flour\n2 teaspoons baking powder\n1/4 teaspoon cinnamon\n1/2 teaspoon salt\n1/4 cup granulated sugar\n1 cup milk\n2 large eggs\n1 tablespoon unsalted butter, melted\n1 large apple,"
]
}
],
"usage": {
"input": 6,
"output": 100
}
}
],
"status": "COMPLETED",
"workerId": "l0efghtlo64wf5"
}
3 Replies
you can use openai sdk:
if you don't want, change your request to this
Solution
https://github.com/runpod-workers/worker-vllm/?tab=readme-ov-file#sampling-parameters
need to expand these 2 for reference
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm