With LLM on runpod is there a cost like other providers like tokens and if its serverless
Hi we want to run a LLM on runpod but I am concerned about running serverless as its pretty slow and we need the LLM to be pretty much instant? The other thing is we dont want to run a GPU all the time as it ends up costing a lot? Can someone out there give me some advice please?
6 Replies
With serverless you only are charged when you worker is responding to requests. There is a limit of input. With RUN (the async endpoint) the maximum is 10MB. With RUNSYNC (the sync endpoint) the maximum is 20MB. It really only comes in to play when your passing in/out large media files like images/video and people pass in/out s3 buckets for that. But for LLM 10MB is a LOT of text.
Serverless can be just as fast or faster than a pod. Unlike with pods you can have many servers, each with their own GPU, responding to your requests. If you scale you can add active servers, that run all the time.
It depends on what your payloads look like but in general serverless is designed to scale payloads to production level amounts. Pods are more for individuals running payloads for their direct use. The best thing about RunPod serverless is you can scale from 0. If you don't get any requests you won't pay anything.
If I personally ran a Pod I would be worried there that I might get distracted and forget it was running or fall asleep and wake to a larger than expected spend. Since you are not charged for idle time with serverless workers that could never happen with them.
Thanks buddy! We were running a serveless for transcription using Whisper but it takes a while to start up .. ill deploy a LLM in serverless and see how it goes! But thanks again for your advice. I like that the SL doesn't cost us anything for idle time as these GPUs currently are super expensive!!!
Are you using the Faster Whisper model? Also, how are you loading the model? Downloading the model at runtime will add excess startup delay. It is better to build it into the image. Even network volume can add startup delay. Building it into the image, and making sure it doesn't try to download it anyways allows you to get full benefit from Flashboot, which can help speed up subsequent requests.
Yeah there is flashboot to help you with subsequent request but that only works after subsequent request, so if you want to make sure it's warm, use an active worker
yes this might be the issue not sure yet! I tried running meta-llama-Meta-Llama-3.1-8B but its just not working? has anyone else got this one working?