R
RunPod•2mo ago
Bryan

Output guidance with vLLM Host on RunPod

Greetings! I've been using vLLM on my homelab servers for a while and I'm looking to add the ability to scale my application using RunPod. On my locally hosted vLLM instances, I use output guidance via the "outlines" guided decoder to constrain LLM output to specified Json Schemas or Regex. One question I haven't been able to find an answer to: Does RunPod support this functionality with serverless vLLM hosting in the OpenAI API? (I assume it supports it with pods if you set up your own instance of vLLM) It's looking like the answer is no, but I'm hopeful the answer is "yes" as I'd really like to take advantage of the benefits of serverless hosting AND guided output. Appreciate any help or insight you can provide. Thanks in advance, cheers.
13 Replies
Madiator2011
Madiator2011•2mo ago
Not sure if I got question right but vLLM woekr exposes OpenAI API
Madiator2011
Madiator2011•2mo ago
I'm sorry not using vLLM myself too much but have you tried to look at worker docs on github? https://github.com/runpod-workers/worker-vllm
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
Bryan
Bryan•2mo ago
Hey thanks for responding. So, on a self-hosted vLLM instance you can add a "guided_json" or "guided_regex" parameter to the payload which constrains the output of the LLM to only generate tokens that match within the contraints. This allows you to consistently 100% of the time get valid JSON in the exact format you request from even small models.
I did check the documentation and have not seen that this parameter is supported on the RunPod version of vLLM (as far as I can tell, I may have missed it.)
Bryan
Bryan•2mo ago
GitHub
GitHub - noamgat/lm-format-enforcer: Enforce the output format (JSO...
Enforce the output format (JSON Schema, Regex etc) of a language model - noamgat/lm-format-enforcer
Bryan
Bryan•2mo ago
This is another decoder which is also packaged by default within vLLM
Madiator2011
Madiator2011•2mo ago
Hmm cant find any info about that but you can always sumbit issue on github page of worker
Bryan
Bryan•2mo ago
Ah yeah that's a good idea, I can do that as well. In the meantime, here's the documentation on the vLLM docs about this feature (it's in the "extra" parameters): https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api Side note, if you're not using this feature, highly encourage that you check it out. Very useful for better LLM output.
Madiator2011
Madiator2011•2mo ago
you could try send request like in examples
Bryan
Bryan•2mo ago
This is on a whole other level from that-- 7B models tend to screw up output schema even if you provide it with a bunch of examples. Using this method makes it nearly impossible for even small LLMs to output invalid JSON.
nerdylive
nerdylive•2mo ago
Wow didn't know that exists it's cool Have you tried deploying on vllm worker and use that json output thing
Bryan
Bryan•2mo ago
Yeah it's super useful honestly, it can be a bit buggy with Regex/Schema incompatibilities though. Some features of Regex are not fully supported and will crash vLLM, so you have to experiment and find things that work.. But when they work, it works very well.
I did try it but I'm running into a different issue where I'm getting a 404 when trying to request completions... I think it's my HttpClient though, not RunPod's fault. I'm not using the standard OpenAI library (because I added some other features) and that might be biting me in the butt at the moment 😅 @nerdylive @Papa Madiator Fixed my HttpClient issue, tried out guided_json with RunPod serverless -- GOOD NEWS: It works! I will still open an issue on the vllm_worker github if only just to recommend that they update the documentation noting that it DOES support guided output contraints Right now I'm running into a different issue though-- vLLM does not stop completing on llama3 until it hits the max_token limit. I know this was a known issue with vLLM which got corrected in a recent release, not sure if RunPod serverless is using an older version still.
Alpay Ariyak
Alpay Ariyak•2mo ago
Instead of stable, try dev image It has the latest vLLM version
Bryan
Bryan•2mo ago
Hey thanks, yeah I did end up figuring that one out a few hours ago-- stable actually has it too now so either will work if you create a new endpoint manually, but the vLLM "quick start" option in the serverless endpoint setup has the older version of vLLM it seems. Onto a new hurdle now 😄 Now I'm getting blank responses in my inference client when calling a LLama3 70B model on one of my endpoints. Works when I send curl messages. Too tired to troubleshoot tonight, this is tomorrow's problem to solve.