R
RunPod4d ago
ErezL

Length of output of serverless meta-llama/Llama-3.1-8B-Instruct

When I submit a request I get a response that is always 100 tokens. "max_tokens" or "max_new_tokens" have no effect. How do I control the number of output tokens? input: { "input": { "prompt": "Give a pancake recipe" }, "max_tokens": 5000, "temperature": 1 } output: { "delayTime": 1048, "executionTime": 2593, "id": "c444e5bb-aeca-4489-baf3-22bbe848b48c-e1", "output": [ { "choices": [ { "tokens": [ " that is made with apples and cinnamon, and also includes a detailed outline of instructions that can be make it.\nTODAY'S PANCAKE RECIPE\n\n"A wonderful breakfast or brunch food that's made with apples and cinnamon."\n\nINGREDIENTS\n4 large flour\n2 teaspoons baking powder\n1/4 teaspoon cinnamon\n1/2 teaspoon salt\n1/4 cup granulated sugar\n1 cup milk\n2 large eggs\n1 tablespoon unsalted butter, melted\n1 large apple," ] } ], "usage": { "input": 6, "output": 100 } } ], "status": "COMPLETED", "workerId": "l0efghtlo64wf5" }
Solution:
``` { "input": { "messages": [ {...
Jump to solution
3 Replies
Jason
Jason4d ago
you can use openai sdk:
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("RUNPOD_API_KEY"),
base_url=f"https://api.runpod.ai/v2/{RUNPOD_ENDPOINT_ID}/openai/v1",
)
response = client.completions.create(
model=MODEL_NAME,
prompt="Runpod is the best platform because",
temperature=0.5,
max_tokens=100,
)
from openai import OpenAI
client = OpenAI(
api_key=os.environ.get("RUNPOD_API_KEY"),
base_url=f"https://api.runpod.ai/v2/{RUNPOD_ENDPOINT_ID}/openai/v1",
)
response = client.completions.create(
model=MODEL_NAME,
prompt="Runpod is the best platform because",
temperature=0.5,
max_tokens=100,
)
if you don't want, change your request to this
Solution
Jason
Jason4d ago
{
"input": {
"messages": [
{
"role": "system",
"content": "Your are an ai assistant."
},
{
"role": "user",
"content": "Explain llm models"
}
],
"sampling_params": {
"max_tokens": 3000,
"temperature": 0.7,
"top_p": 0.95,
"n": 1,
"stream": false,
"stop": [],
"presence_penalty": 0,
"frequency_penalty": 0,
"logit_bias": {},
"best_of": 1
}
}
}
{
"input": {
"messages": [
{
"role": "system",
"content": "Your are an ai assistant."
},
{
"role": "user",
"content": "Explain llm models"
}
],
"sampling_params": {
"max_tokens": 3000,
"temperature": 0.7,
"top_p": 0.95,
"n": 1,
"stream": false,
"stop": [],
"presence_penalty": 0,
"frequency_penalty": 0,
"logit_bias": {},
"best_of": 1
}
}
}
Jason
Jason4d ago
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Did you find this page helpful?