update worker-vllm to vllm 0.5.0

vLLM just got bumped to 0.5.0 with significant features being ready for production. @Alpay Ariyak FP8 is very significant but so is speculative decoding and prefix caching. - FP8 support is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost. - Add OpenAI Vision API support. Currently only LLaVA and LLaVA-NeXT are supported. - Speculative Decoding and Automatic Prefix Caching is also ready for testing, we plan to turn them on by default in upcoming releases.
Solution:
For sure, already in progress!
Jump to solution
2 Replies
Solution
Alpay Ariyak
Alpay Ariyak4w ago
For sure, already in progress!
Casper.
Casper.4w ago
Nice to hear it's already in progress! Let me know when it's ready and I would love to test it out