Response is always 16 tokens.
Hello. I'm new to Cloud and tried following the Docs for serverless running Google's Gemma 7b model. After the endpoint is successfully set up and I do a test request within the RunPod dashboard I noticed the response is always 16 tokens. I tested locally with Postman, and used a variety of prompts but always get a truncated 16 tokens back.
I also tried with Llama 3.1 8b Instruct using the vLLM template, and made sure to set the max sequence length to something high (like 6k), but still only get 16 tokens back.
I've also tried setting the max_tokens directly in the request. I'm not sure what I'm doing wrong.
1 Reply
For the curious, it was the
max_tokens
parameter, which I suspected but didn't know how to remedy. Turns out the proper way to set max_tokens
in the JSON body of the request is in a sampling_params
dictionary, and not a sibling to prompt
.
So intead of
it needs to be
which I couldn't figure out until I found the JobInput
class