lostdev Posts - Answer Overflow

lostdev

•Created by lostdev on 9/1/2024 in #⚡｜serverless

Response is always 16 tokens.

Hello. I'm new to Cloud and tried following the Docs for serverless running Google's Gemma 7b model. After the endpoint is successfully set up and I do a test request within the RunPod dashboard I noticed the response is always 16 tokens. I tested locally with Postman, and used a variety of prompts but always get a truncated 16 tokens back. I also tried with Llama 3.1 8b Instruct using the vLLM template, and made sure to set the max sequence length to something high (like 6k), but still only get 16 tokens back. I've also tried setting the max_tokens directly in the request. I'm not sure what I'm doing wrong.

2 replies

Gaming

Programming