RunPod•10mo ago

Output guidance with vLLM Host on RunPod

Greetings! I've been using vLLM on my homelab servers for a while and I'm looking to add the ability to scale my application using RunPod. On my locally hosted vLLM instances, I use output guidance via the "outlines" guided decoder to constrain LLM output to specified Json Schemas or Regex. One question I haven't been able to find an answer to: Does RunPod support this functionality with serverless vLLM hosting in the OpenAI API? (I assume it supports it with pods if you set up your own instance of vLLM) It's looking like the answer is no, but I'm hopeful the answer is "yes" as I'd really like to take advantage of the benefits of serverless hosting AND guided output. Appreciate any help or insight you can provide. Thanks in advance, cheers.

13 Replies

Madiator2011•10mo ago

Not sure if I got question right but vLLM woekr exposes OpenAI API

Madiator2011•10mo ago

I'm sorry not using vLLM myself too much but have you tried to look at worker docs on github? https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

BryanOP•10mo ago

Hey thanks for responding. So, on a self-hosted vLLM instance you can add a "guided_json" or "guided_regex" parameter to the payload which constrains the output of the LLM to only generate tokens that match within the contraints. This allows you to consistently 100% of the time get valid JSON in the exact format you request from even small models.
I did check the documentation and have not seen that this parameter is supported on the RunPod version of vLLM (as far as I can tell, I may have missed it.)

BryanOP•10mo ago

Check this out: https://github.com/noamgat/lm-format-enforcer?tab=readme-ov-file#vllm-server-integration

GitHub

GitHub - noamgat/lm-format-enforcer: Enforce the output format (JSO...

Enforce the output format (JSON Schema, Regex etc) of a language model - noamgat/lm-format-enforcer

BryanOP•10mo ago

This is another decoder which is also packaged by default within vLLM

Madiator2011•10mo ago

Hmm cant find any info about that but you can always sumbit issue on github page of worker

BryanOP•10mo ago

Ah yeah that's a good idea, I can do that as well. In the meantime, here's the documentation on the vLLM docs about this feature (it's in the "extra" parameters): https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#extra-parameters-for-chat-api Side note, if you're not using this feature, highly encourage that you check it out. Very useful for better LLM output.

Madiator2011•10mo ago

you could try send request like in examples

BryanOP•10mo ago

This is on a whole other level from that-- 7B models tend to screw up output schema even if you provide it with a bunch of examples. Using this method makes it nearly impossible for even small LLMs to output invalid JSON.

nerdylive•10mo ago

Wow didn't know that exists it's cool Have you tried deploying on vllm worker and use that json output thing

BryanOP•10mo ago

Yeah it's super useful honestly, it can be a bit buggy with Regex/Schema incompatibilities though. Some features of Regex are not fully supported and will crash vLLM, so you have to experiment and find things that work.. But when they work, it works very well.
I did try it but I'm running into a different issue where I'm getting a 404 when trying to request completions... I think it's my HttpClient though, not RunPod's fault. I'm not using the standard OpenAI library (because I added some other features) and that might be biting me in the butt at the moment 😅 @nerdylive @Papa Madiator Fixed my HttpClient issue, tried out guided_json with RunPod serverless -- GOOD NEWS: It works! I will still open an issue on the vllm_worker github if only just to recommend that they update the documentation noting that it DOES support guided output contraints Right now I'm running into a different issue though-- vLLM does not stop completing on llama3 until it hits the max_token limit. I know this was a known issue with vLLM which got corrected in a recent release, not sure if RunPod serverless is using an older version still.

Alpay Ariyak•10mo ago

Instead of stable, try dev image It has the latest vLLM version

BryanOP•10mo ago

Hey thanks, yeah I did end up figuring that one out a few hours ago-- stable actually has it too now so either will work if you create a new endpoint manually, but the vLLM "quick start" option in the serverless endpoint setup has the older version of vLLM it seems. Onto a new hurdle now 😄 Now I'm getting blank responses in my inference client when calling a LLama3 70B model on one of my endpoints. Works when I send curl messages. Too tired to troubleshoot tonight, this is tomorrow's problem to solve.

Gaming

Programming

Output guidance with vLLM Host on RunPod

Did you find this page helpful?