Charixfox
Charixfox
RRunPod
Created by jvm-cb on 6/27/2024 in #⚡|serverless
Maximum queue size
It sounds like the confusion is over terms: "Running" vs "Idle" -> A worker only costs while it is Running. "Active" vs "Max" -> An active worker is "Always on shift" and so effectively always Running, but costs 40% less. Max workers are how many total might possibly be brought in to work. Max minus Active = Temp Workers, and they also are not costing anything unless they are Running. When there is nothing to process - no queue at all - there is no worker Running, so no cost for the worker(s). When the queue has ANYTHING in it, a worker will run - and cost money - to process the next thing in queue, up to the max number of workers. If you intend to have a non-empty queue at all times, you should have enough "Active" workers to handle the normal load of the queue and cost the least. Then bigger loads will pull in "Temp workers" up to the Max count to handle the queue faster until it goes down.
57 replies
RRunPod
Created by octopus on 6/25/2024 in #⚡|serverless
Distributing model across multiple GPUs using vLLM
It does. I blame vLLM.
10 replies
RRunPod
Created by octopus on 6/25/2024 in #⚡|serverless
Distributing model across multiple GPUs using vLLM
So, 1 , 2, 4, 8, 16, 32, and 64.
10 replies
RRunPod
Created by octopus on 6/25/2024 in #⚡|serverless
Distributing model across multiple GPUs using vLLM
vLLM specifically says 64 / (GPU Count) must have no modulus.
10 replies
RRunPod
Created by Armyk on 5/30/2024 in #⚡|serverless
GGUF in serverless vLLM
True enough. I guess can we really say that any FOSS solution is 'production ready' right now?
58 replies
RRunPod
Created by Armyk on 5/30/2024 in #⚡|serverless
GGUF in serverless vLLM
As in, there is no turnkey solution or "Just Type This" solution to deploying Aphrodite Engine on serverless.
58 replies
RRunPod
Created by Armyk on 5/30/2024 in #⚡|serverless
GGUF in serverless vLLM
On last gen 48GB for example, it's 40% active runtime (Actively processing requests)
58 replies
RRunPod
Created by Armyk on 5/30/2024 in #⚡|serverless
GGUF in serverless vLLM
If all else fails, just run the numbers to see if serverless will be better for your use cases. There's an amount of active time where pods become more cost efficient.
58 replies
RRunPod
Created by Armyk on 5/30/2024 in #⚡|serverless
GGUF in serverless vLLM
Until vLLM supports more quant formats, you'll have to have an AWQ, SqueezeLLM, or GPTQ quant of the model. I used a Jupyter pod to make an AWQ of the model I wanted. Or if Aphrodite-Engine ever works on serverless, that will be an option too.
58 replies
RRunPod
Created by Charixfox on 5/21/2024 in #⚡|serverless
Speed up cold start on large models
I'll give that a try. Optimally I'll find up to date info on how to quantize it to an AWQ that isn't paywalled. Thank you!
17 replies
RRunPod
Created by Charixfox on 5/21/2024 in #⚡|serverless
Speed up cold start on large models
Which is why it's so big.
17 replies
RRunPod
Created by Charixfox on 5/21/2024 in #⚡|serverless
Speed up cold start on large models
I'm using the unquantized model, not a GGUF.
17 replies
RRunPod
Created by Charixfox on 5/21/2024 in #⚡|serverless
Speed up cold start on large models
Sadly there's no AWQ 4bit quant of the model, only GGUF, which vLLM doesn't support. I'd make a quant if I could determine how to do that successfully.
17 replies
RRunPod
Created by Charixfox on 5/21/2024 in #⚡|serverless
Speed up cold start on large models
Baking the f16 model would create a 170GB image which Docker Hub won't support so I'm not sure how I'd get it onto the worker. I'm willing to try network storage though I'd need documentation on how to get that set up properly and access the model on it from a cold start worker.
17 replies