update worker-vllm to vllm 0.5.0
vLLM just got bumped to 0.5.0 with significant features being ready for production. @Alpay Ariyak
FP8 is very significant but so is speculative decoding and prefix caching.
- FP8 support is ready for testing. By quantizing the portion model weights to 8 bit precision float point, the inference speed gets 1.5x boost.
- Add OpenAI Vision API support. Currently only LLaVA and LLaVA-NeXT are supported.
- Speculative Decoding and Automatic Prefix Caching is also ready for testing, we plan to turn them on by default in upcoming releases.
2 Replies
Solution
For sure, already in progress!
Nice to hear it's already in progress! Let me know when it's ready and I would love to test it out