RunPod•4mo ago

Serverless SGLang - 128 max token limit problem.

I'm trying to use the subject template. I have always the same problem, the number of token of the answer is limited to 128. I don't know how to change the configuration.,,, I've tried with Llama 3.2 3B and Mistral 7B and with both happens the same problem. I've tried to ste the following environment variables with higher numbers than 128 with now luck ... CONTEXT_LENGTH MAX_TOTAL_TOKENS MAX_PREFILL_TOKENS CHUNKED_PREFILL_SIZE STREAM_INTERVAL BLOCK_SIZE MAX_TOKENS COMPLETION_MAX_TOKENS MAX_OUTPUT_TOKENS OUTPUT_TOKENS LLAMA_MAX_OUTPUT_TOKENS MAX_LENGTH COMPLETION_TOKENS COMPLETION_TOKENS_WO_JUMP_FORWARD LENGTH Request: { "input": { "text": "Give me list of the US States names." } } Answer: { "delayTime": 721, "executionTime": 3503, "id": "fa5e8637-5636-4e74-a1ad-de63f8b20301-u1", "output": [ { "meta_info": { "cached_tokens": 1, "completion_tokens": 128, "completion_tokens_wo_jump_forward": 128, "finish_reason": { "length": 128, "type": "length" }, "id": "c8ab53d4847a4e8687dfbe9abbefd90c", "prompt_tokens": 10 }, "text": "\n\nWhy would anyone want a random list of all 50 states names?\n\nYou may want a randomized list as an example of various techniques you can use:\n\n- a list of sets of randomly selected items that the “random list” has in common with the list from which the “random list” was generated (I have collected some of mine in the articles How Many Holes in Swiss Cheese and How Close You Are to Finding a Unicorn, but not exclusively used in the same way)\n- a list of sets of randomly selected items that the “random list” has in common with" } ], "status": "COMPLETED", "workerId": "0ifvvn3bcyfzii" } Any suggestion?

10 Replies

Jason•4mo ago

Maybe you should use the openai endpoint api Like the one in vllm, same url format, using endpoint ID too

NelsonOP•4mo ago

The same happens but in this case the every time the maximum is 16 😦 REQUEST { "input": { "prompt": "Write a poem about nature." } } OUTPUT { "delayTime": 432, "executionTime": 299, "id": "4bac6d1c-6f5b-454a-a2a0-419cf6a6ecbf-u1", "output": [ { "choices": [ { "tokens": [ " \nThe sun shines bright in the morning sky,\nA fiery hue, that catches" ] } ], "usage": { "input": 7, "output": 16 } } ], "status": "COMPLETED", "workerId": "gzgn2o5m506yzj" }

zilli•4mo ago

I don't know how to change the configuration.,,, I've tried to set the following environment variables with higher numbers than 128 with now luck ...

In the configuration, did you click the save button at the bottom after changing the value? If you did, did you also remove your serverless workers afterward, and allow them to initialize again with your new settings?

Jason•4mo ago

Use openai library Is it possible? Chevk the runpod docs on vllm using openai library, to use it But set the right endpoint ID there

NelsonOP•4mo ago

Thanks for your answers. Unfortunatelly I tested all the recommendations, and the usage of the openai library with no luck ... here you have the example of what I'm sending and the 16 token I've as return from the vllm endpoint... REQUEST { "input": { "messages": [ { "role": "user", "content": "What is AI?" } ], "temperature": 0.7, "max_tokens": 500 } } ANSWER { "delayTime": 406, "executionTime": 393, "id": "14de16db-8b3d-444e-9358-9d4a001c61b9-u1", "output": [ { "choices": [ { "tokens": [ "Artificial Intelligence (AI) refers to the development of computer systems that can perform" ] } ], "usage": { "input": 40, "output": 16 } } ], "status": "COMPLETED", "workerId": "hn4gcdunggpaoc" } As I said at the beginning I continue with 128 tokens for the sglang and 16 for the vllm one ....

Jason•4mo ago

Oh okay hmm that's weird what model are you using? Btw this request and response seems to be using runpod's /run endpoint not openai compatible endpoint https://discord.com/channels/912829806415085598/1279829584749138109/1280021380326494208 Ok nice got the solution hahah

NelsonOP•4mo ago

I'm using this model https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct but I've also tried with Mistral and I had the same results. And you are ok, I tried to use the Open AI template but I'm still using the this URL POST https://api.runpod.ai/v2/<MODEL_ID>/run Reading the documentation tells me to use this URL base_url="https://api.runpod.ai/v2/<YOUR ENDPOINT ID>/openai/v1", but didn't work ...

meta-llama/Llama-3.2-3B-Instruct · Hugging Face

Jason•4mo ago

This

NelsonOP•4mo ago

Many thanks for your support!!! you ara genius, I finally could make it work with the following ..... REQUEST { "input": { "text": "Tell me a story about three bananas who solve the case of the missing hamburger.", "sampling_params": { "max_new_tokens": 5000, "temperature": 0 } } } I'm using the sglang now. Thanks again.

message.txt

Jason•4mo ago

Happy you got it finnaly working hahah Your welcome

Gaming

Programming

Serverless SGLang - 128 max token limit problem.

Did you find this page helpful?