ErezL
ErezL
RRunPod
Created by ErezL on 3/30/2025 in #⚡|serverless
Length of output of serverless meta-llama/Llama-3.1-8B-Instruct
When I submit a request I get a response that is always 100 tokens. "max_tokens" or "max_new_tokens" have no effect. How do I control the number of output tokens? input: { "input": { "prompt": "Give a pancake recipe" }, "max_tokens": 5000, "temperature": 1 } output: { "delayTime": 1048, "executionTime": 2593, "id": "c444e5bb-aeca-4489-baf3-22bbe848b48c-e1", "output": [ { "choices": [ { "tokens": [ " that is made with apples and cinnamon, and also includes a detailed outline of instructions that can be make it.\nTODAY'S PANCAKE RECIPE\n\n"A wonderful breakfast or brunch food that's made with apples and cinnamon."\n\nINGREDIENTS\n4 large flour\n2 teaspoons baking powder\n1/4 teaspoon cinnamon\n1/2 teaspoon salt\n1/4 cup granulated sugar\n1 cup milk\n2 large eggs\n1 tablespoon unsalted butter, melted\n1 large apple," ] } ], "usage": { "input": 6, "output": 100 } } ], "status": "COMPLETED", "workerId": "l0efghtlo64wf5" }
6 replies
RRunPod
Created by ErezL on 3/30/2025 in #⚡|serverless
I am trying to deploy a "meta-llama/Llama-3.1-8B-Instruct" model on Serverless vLLM
I do this with maximum possible memory. After setup, I try to run the "hello world" sample, but the request is stuck in queue and I get "[error]worker exited with exit code 1" with no other error or message in log. Is it even possible to run this model? What is the problem? can this be resolved? (for the record, I did manage to run a much smaller model using the same procedure as above)
40 replies