Bryan
Bryan
RRunPod
Created by md on 5/12/2024 in #⚡|serverless
Run Mixtral 8x22B Instruct on vLLM worker
But my understanding is that you can just fit 8x22B on 4x80GB with 8bit quantization
126 replies
RRunPod
Created by md on 5/12/2024 in #⚡|serverless
Run Mixtral 8x22B Instruct on vLLM worker
I could be mistaken around this, I'm not an expert on this for sure
126 replies
RRunPod
Created by md on 5/12/2024 in #⚡|serverless
Run Mixtral 8x22B Instruct on vLLM worker
At 8bit (1 byte per parameter) it's still 176GB
126 replies
RRunPod
Created by md on 5/12/2024 in #⚡|serverless
Run Mixtral 8x22B Instruct on vLLM worker
8x22B = 176B parameters. At 16bit, 2 bytes per parameter, that's 352GB just for the model parameters
126 replies
RRunPod
Created by md on 5/12/2024 in #⚡|serverless
Run Mixtral 8x22B Instruct on vLLM worker
In which case ignore me 🙂
126 replies
RRunPod
Created by md on 5/12/2024 in #⚡|serverless
Run Mixtral 8x22B Instruct on vLLM worker
Unless you're doing a pod instead of serverless
126 replies
RRunPod
Created by md on 5/12/2024 in #⚡|serverless
Run Mixtral 8x22B Instruct on vLLM worker
No description
126 replies
RRunPod
Created by Bryan on 5/11/2024 in #⚡|serverless
Output guidance with vLLM Host on RunPod
Onto a new hurdle now 😄 Now I'm getting blank responses in my inference client when calling a LLama3 70B model on one of my endpoints. Works when I send curl messages. Too tired to troubleshoot tonight, this is tomorrow's problem to solve.
20 replies
RRunPod
Created by Bryan on 5/11/2024 in #⚡|serverless
Output guidance with vLLM Host on RunPod
Hey thanks, yeah I did end up figuring that one out a few hours ago-- stable actually has it too now so either will work if you create a new endpoint manually, but the vLLM "quick start" option in the serverless endpoint setup has the older version of vLLM it seems.
20 replies
RRunPod
Created by Bryan on 5/11/2024 in #⚡|serverless
Output guidance with vLLM Host on RunPod
Right now I'm running into a different issue though-- vLLM does not stop completing on llama3 until it hits the max_token limit. I know this was a known issue with vLLM which got corrected in a recent release, not sure if RunPod serverless is using an older version still.
20 replies
RRunPod
Created by Bryan on 5/11/2024 in #⚡|serverless
Output guidance with vLLM Host on RunPod
@nerdylive @Papa Madiator Fixed my HttpClient issue, tried out guided_json with RunPod serverless -- GOOD NEWS: It works! I will still open an issue on the vllm_worker github if only just to recommend that they update the documentation noting that it DOES support guided output contraints
20 replies
RRunPod
Created by Bryan on 5/11/2024 in #⚡|serverless
Output guidance with vLLM Host on RunPod
Yeah it's super useful honestly, it can be a bit buggy with Regex/Schema incompatibilities though. Some features of Regex are not fully supported and will crash vLLM, so you have to experiment and find things that work.. But when they work, it works very well.
I did try it but I'm running into a different issue where I'm getting a 404 when trying to request completions... I think it's my HttpClient though, not RunPod's fault. I'm not using the standard OpenAI library (because I added some other features) and that might be biting me in the butt at the moment 😅
20 replies
RRunPod
Created by Bryan on 5/11/2024 in #⚡|serverless
Output guidance with vLLM Host on RunPod
This is on a whole other level from that-- 7B models tend to screw up output schema even if you provide it with a bunch of examples. Using this method makes it nearly impossible for even small LLMs to output invalid JSON.
20 replies
RRunPod
Created by Bryan on 5/11/2024 in #⚡|serverless
Output guidance with vLLM Host on RunPod
Side note, if you're not using this feature, highly encourage that you check it out. Very useful for better LLM output.
20 replies
RRunPod
Created by Bryan on 5/11/2024 in #⚡|serverless
Output guidance with vLLM Host on RunPod
Ah yeah that's a good idea, I can do that as well. In the meantime, here's the documentation on the vLLM docs about this feature (it's in the "extra" parameters): https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api
20 replies
RRunPod
Created by Bryan on 5/11/2024 in #⚡|serverless
Output guidance with vLLM Host on RunPod
This is another decoder which is also packaged by default within vLLM
20 replies
RRunPod
Created by Bryan on 5/11/2024 in #⚡|serverless
Output guidance with vLLM Host on RunPod
20 replies
RRunPod
Created by Bryan on 5/11/2024 in #⚡|serverless
Output guidance with vLLM Host on RunPod
Hey thanks for responding. So, on a self-hosted vLLM instance you can add a "guided_json" or "guided_regex" parameter to the payload which constrains the output of the LLM to only generate tokens that match within the contraints. This allows you to consistently 100% of the time get valid JSON in the exact format you request from even small models.
I did check the documentation and have not seen that this parameter is supported on the RunPod version of vLLM (as far as I can tell, I may have missed it.)
20 replies