Bryan Comments - Answer Overflow

Topics

Bryan

•Created by md on 5/12/2024 in #⚡｜serverless

Run Mixtral 8x22B Instruct on vLLM worker

But my understanding is that you can just fit 8x22B on 4x80GB with 8bit quantization

126 replies

•Created by md on 5/12/2024 in #⚡｜serverless

Run Mixtral 8x22B Instruct on vLLM worker

I could be mistaken around this, I'm not an expert on this for sure

126 replies

•Created by md on 5/12/2024 in #⚡｜serverless

Run Mixtral 8x22B Instruct on vLLM worker

At 8bit (1 byte per parameter) it's still 176GB

126 replies

•Created by md on 5/12/2024 in #⚡｜serverless

Run Mixtral 8x22B Instruct on vLLM worker

8x22B = 176B parameters. At 16bit, 2 bytes per parameter, that's 352GB just for the model parameters

126 replies

•Created by md on 5/12/2024 in #⚡｜serverless

Run Mixtral 8x22B Instruct on vLLM worker

In which case ignore me 🙂

126 replies

•Created by md on 5/12/2024 in #⚡｜serverless

Run Mixtral 8x22B Instruct on vLLM worker

Unless you're doing a pod instead of serverless

126 replies

•Created by md on 5/12/2024 in #⚡｜serverless

Run Mixtral 8x22B Instruct on vLLM worker

No description

126 replies

•Created by Bryan on 5/11/2024 in #⚡｜serverless

Output guidance with vLLM Host on RunPod

Onto a new hurdle now 😄 Now I'm getting blank responses in my inference client when calling a LLama3 70B model on one of my endpoints. Works when I send curl messages. Too tired to troubleshoot tonight, this is tomorrow's problem to solve.

20 replies

•Created by Bryan on 5/11/2024 in #⚡｜serverless

Output guidance with vLLM Host on RunPod

Hey thanks, yeah I did end up figuring that one out a few hours ago-- stable actually has it too now so either will work if you create a new endpoint manually, but the vLLM "quick start" option in the serverless endpoint setup has the older version of vLLM it seems.

20 replies

•Created by Bryan on 5/11/2024 in #⚡｜serverless

Output guidance with vLLM Host on RunPod

Right now I'm running into a different issue though-- vLLM does not stop completing on llama3 until it hits the max_token limit. I know this was a known issue with vLLM which got corrected in a recent release, not sure if RunPod serverless is using an older version still.

20 replies

•Created by Bryan on 5/11/2024 in #⚡｜serverless

Output guidance with vLLM Host on RunPod

@nerdylive @Papa Madiator Fixed my HttpClient issue, tried out guided_json with RunPod serverless -- GOOD NEWS: It works! I will still open an issue on the vllm_worker github if only just to recommend that they update the documentation noting that it DOES support guided output contraints

20 replies

•Created by Bryan on 5/11/2024 in #⚡｜serverless

Output guidance with vLLM Host on RunPod

Yeah it's super useful honestly, it can be a bit buggy with Regex/Schema incompatibilities though. Some features of Regex are not fully supported and will crash vLLM, so you have to experiment and find things that work.. But when they work, it works very well.
I did try it but I'm running into a different issue where I'm getting a 404 when trying to request completions... I think it's my HttpClient though, not RunPod's fault. I'm not using the standard OpenAI library (because I added some other features) and that might be biting me in the butt at the moment 😅

20 replies

•Created by Bryan on 5/11/2024 in #⚡｜serverless

Output guidance with vLLM Host on RunPod

This is on a whole other level from that-- 7B models tend to screw up output schema even if you provide it with a bunch of examples. Using this method makes it nearly impossible for even small LLMs to output invalid JSON.

20 replies

•Created by Bryan on 5/11/2024 in #⚡｜serverless

Output guidance with vLLM Host on RunPod

Side note, if you're not using this feature, highly encourage that you check it out. Very useful for better LLM output.

20 replies

•Created by Bryan on 5/11/2024 in #⚡｜serverless

Output guidance with vLLM Host on RunPod

Ah yeah that's a good idea, I can do that as well. In the meantime, here's the documentation on the vLLM docs about this feature (it's in the "extra" parameters): https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api

20 replies

•Created by Bryan on 5/11/2024 in #⚡｜serverless

Output guidance with vLLM Host on RunPod

This is another decoder which is also packaged by default within vLLM

20 replies

•Created by Bryan on 5/11/2024 in #⚡｜serverless

Output guidance with vLLM Host on RunPod

Check this out: https://github.com/noamgat/lm-format-enforcer?tab=readme-ov-file#vllm-server-integration

20 replies

•Created by Bryan on 5/11/2024 in #⚡｜serverless

Output guidance with vLLM Host on RunPod

Hey thanks for responding. So, on a self-hosted vLLM instance you can add a "guided_json" or "guided_regex" parameter to the payload which constrains the output of the LLM to only generate tokens that match within the contraints. This allows you to consistently 100% of the time get valid JSON in the exact format you request from even small models.
I did check the documentation and have not seen that this parameter is supported on the RunPod version of vLLM (as far as I can tell, I may have missed it.)

20 replies