Charixfox
RRunPod
•Created by octopus on 9/20/2024 in #⚡|serverless
Llama-70B 3.1 execution and queue delay time much larger than 3.0. Why?
You can find sources online that indicate 3.0 averages around three times as fast as 3.1. So while 3.1 is more accurate, 3.0 is speedier.
2 replies
RRunPod
•Created by peteryoung2484 on 9/13/2024 in #⚡|serverless
Is there a way to speed up the reading of external disks(network volume)?
Other threads say expect network loads to take longer, so bake it into the image if you can.
49 replies
RRunPod
•Created by NERDDISCO on 8/9/2024 in #⚡|serverless
Slow network volume
I split the model layers just fine, but one of the stock layers on the worker-vllm image is just shy of 13GB when built, so I'll be poking at that for a bit.
64 replies
RRunPod
•Created by NERDDISCO on 8/9/2024 in #⚡|serverless
Slow network volume
Is that a viable option for such a large model? I was under the impression it only scaled well for smaller models.
64 replies
RRunPod
•Created by NERDDISCO on 8/9/2024 in #⚡|serverless
Slow network volume
Even more oddly, sometimes it will load two segments at 1-4 s/it, and then the next will be 38 s/it, and then the next three will be 50-60 each. It's very inconsistent. The 4+ minute loading is a cold start that I'm paying for every second of, and it might happen when the container is destroyed immediately after a run, and while the container is doing that cold start, any requests routed to it time out on the client side.
64 replies
RRunPod
•Created by NERDDISCO on 8/9/2024 in #⚡|serverless
Slow network volume
s/it, out of seven segments. Full load of the model takes at least 280 seconds in those instances, but about 21 seconds in other geographical areas.
64 replies
RRunPod
•Created by NERDDISCO on 8/9/2024 in #⚡|serverless
Slow network volume
Hopefully this gets resolved soon. I'm using storage to hold a 65GB model on specific GPU hardware and 40+ sec/it to load it is not good at all.
64 replies
RRunPod
•Created by naaviii on 8/29/2024 in #⚡|serverless
Urgent: Issue with Runpod vllm Serverless Endpoint
I encountered the same problem, and swapped to the dev image mentioned after finding this, and things are not directly crashing.
22 replies
RRunPod
•Created by blabbercrab on 7/5/2024 in #⚡|serverless
Serverless is timing out before full load
Ah. I have no advice on that one unfortunately.
39 replies
RRunPod
•Created by blabbercrab on 7/5/2024 in #⚡|serverless
Serverless is timing out before full load
If it's not getting them into memory, that's a problem I'm not sure about. If it's not getting them onto disk from the download before dying, that one I've seen and solved by using a network storage as long as it resumes the download. Then it can work on the downloads until it dies, and pick up where it left of the next time it tries, and eventually finish and not have to download them again afterward.
39 replies
RRunPod
•Created by jvm-cb on 6/27/2024 in #⚡|serverless
Maximum queue size
It sounds like the confusion is over terms:
"Running" vs "Idle" -> A worker only costs while it is Running.
"Active" vs "Max" -> An active worker is "Always on shift" and so effectively always Running, but costs 40% less. Max workers are how many total might possibly be brought in to work. Max minus Active = Temp Workers, and they also are not costing anything unless they are Running.
When there is nothing to process - no queue at all - there is no worker Running, so no cost for the worker(s).
When the queue has ANYTHING in it, a worker will run - and cost money - to process the next thing in queue, up to the max number of workers.
If you intend to have a non-empty queue at all times, you should have enough "Active" workers to handle the normal load of the queue and cost the least. Then bigger loads will pull in "Temp workers" up to the Max count to handle the queue faster until it goes down.
57 replies
RRunPod
•Created by octopus on 6/25/2024 in #⚡|serverless
Distributing model across multiple GPUs using vLLM
It does. I blame vLLM.
10 replies
RRunPod
•Created by octopus on 6/25/2024 in #⚡|serverless
Distributing model across multiple GPUs using vLLM
So, 1 , 2, 4, 8, 16, 32, and 64.
10 replies
RRunPod
•Created by octopus on 6/25/2024 in #⚡|serverless
Distributing model across multiple GPUs using vLLM
vLLM specifically says 64 / (GPU Count) must have no modulus.
10 replies
RRunPod
•Created by Armyk on 5/30/2024 in #⚡|serverless
GGUF in serverless vLLM
True enough. I guess can we really say that any FOSS solution is 'production ready' right now?
58 replies
RRunPod
•Created by Armyk on 5/30/2024 in #⚡|serverless
GGUF in serverless vLLM
As in, there is no turnkey solution or "Just Type This" solution to deploying Aphrodite Engine on serverless.
58 replies
RRunPod
•Created by Armyk on 5/30/2024 in #⚡|serverless
GGUF in serverless vLLM
On last gen 48GB for example, it's 40% active runtime (Actively processing requests)
58 replies
RRunPod
•Created by Armyk on 5/30/2024 in #⚡|serverless
GGUF in serverless vLLM
If all else fails, just run the numbers to see if serverless will be better for your use cases. There's an amount of active time where pods become more cost efficient.
58 replies
RRunPod
•Created by Armyk on 5/30/2024 in #⚡|serverless
GGUF in serverless vLLM
Until vLLM supports more quant formats, you'll have to have an AWQ, SqueezeLLM, or GPTQ quant of the model. I used a Jupyter pod to make an AWQ of the model I wanted. Or if Aphrodite-Engine ever works on serverless, that will be an option too.
58 replies
RRunPod
•Created by Charixfox on 5/21/2024 in #⚡|serverless
Speed up cold start on large models
I'll give that a try. Optimally I'll find up to date info on how to quantize it to an AWQ that isn't paywalled. Thank you!
17 replies