jd24
jd24
RRunPod
Created by jd24 on 4/23/2024 in #⚡|serverless
How does the vLLM serverless worker to support OpenAI API contract?
Many thanks for the explanation, with these info I believe it would be possible to develop a custom worker with an OpenAI contract. According to my maths, mixtral 8x7b loaded with any of the available options that vllm supports, requires 90+ GB VRAM, which exceed the max VRAM available on the serverless platform (even using 2 GPUs). Also, be aware that loading quantized models with for example q4, allow us to depend on smaller, cheaper and more available hardware
12 replies
RRunPod
Created by jd24 on 4/23/2024 in #⚡|serverless
How does the vLLM serverless worker to support OpenAI API contract?
I'm surfing on the vLLM worker repo as well as the runpod-python library, but I don´t see where the magic hacky happens
12 replies
RRunPod
Created by jd24 on 4/23/2024 in #⚡|serverless
How does the vLLM serverless worker to support OpenAI API contract?
@Alpay Ariyak could you point us to a repo where the hacky solution for vLLM was implemented?
12 replies
RRunPod
Created by jd24 on 4/23/2024 in #⚡|serverless
How does the vLLM serverless worker to support OpenAI API contract?
What I want to achieve is a runpod worker (similar to the vLLM one) but for ollama (with streaming support), since this tool allows to load quantized models that cant fit into the avalilable GPU VRAM. For example, I couldn´t load mixtral 8x7b on serverless because via vLLM it takes too much VRAM (it works only with fp16 params)
12 replies