Does GPU Cloud is suitable for deploying LLM or only for training?
I'm pretty new in RunPod, I have already build 4 endpoints on Serverless and it's pretty straight-forward for me, however I don't understand is GPU Cloud is also suitalbe for pure LLM Inferencing via API for chatbot purposers or it's only for training models and saving weights. The main question is that can I also deploy my LLM for inference on GPU Cloud for production? Where to find API on which I should make calls? Because I find Serverless very unstable for production, or maybe it's mine fault, whenever the worker starts, it choose to download again model weights, which sometimes weight 100GB+ it takes 5-15 minutes, after user will make his query he would need to wait 15 minutes for response from Serverless, while worker first downloads weights from HuggingFace and than make inference
15 Replies
GPU cloud does not scale as well as Serverless and should typically not be used for production applications, serverless is better suited for production applications that need to serve multiple concurrent users. Serverless is fine for what you are doing, you have just built your worker incorrectly. You should either build the model into your Docker image or store it on a network volume but should definitely not be downloading it within the worker itself.
Understood, thanks, between these two options, building the model inside the docker image or storing on a network volume, which one is better?
I have added network volume, but it doesn't work seems like
Inside Docker image because then you are not restricted to a specific region and have higher GPU availability and it's also very slow to load large models from the network volume disk
Awesome, thanks, than this is the way to go, one more question about FlashBoost, should I always use it? In order to reduce cold starts to 2s, even for big 70B LLM, or it has some restrictions and possible issues?
It doesn't always reduce cold starts to 2s and is also only really beneficial if you have a constant flow of requests
Better to keep it on imo still, and set your max workers to 3, is better to have more max workers, runpod has some issues with max-workers at 1, cause it's more a development stage.
U don't pay for workers unless they go active from a request
Other users using the GPU causing it to become to become throttled when you set max workers to 1 isn't a RunPod issue, it's a user issue because RunPod defaults it to 3 not to 1
Yeah, maybe a user education issue tho.
on best practice
agree. i guess when most ppl start off with runpod it isn't an expectation to get throttled / sometimes i find that runpod has a weird initialization with max of 1
Best practice is to use the sane defaults and not be a moron by changing it 😂
xD fair. but i can see why ppl change it. when i first started, I think it was like, I was running out of endpoints by everything being 3. (cause i only have a max of 10 at the time) And then I changed it to one thinking I just need one active endpoint like a lambda function. Not really discussed in the docs that the GPUs get throttled
so a lot of people will assume its like a lambda - kinda marketed i feel as a lambda with a gpu
wish in the docs it said max of 2 workers also gets u the 5 workers on the side. I think flash said 1 is considered development, 2 or more is considered prod
I think too many people use Serverless to save money on occasional inference instead of using it to scale out their production applications
2 to 5 all give you 5 but it's not really 5, the max is still honoured, so if you have 2, only 2 will run at once, the other 3 are just there to help with throttling
Yup yup
One more question, I believe you do have experience with HF transformers and LLMs, do you know what command to put into Dockerfile in order to get weights pre-downloaded while building Docker image, so than they can be later in the endpoint loaded with .from_pretrained? Or I'm looking into wrong side?
I thought just downloading model repo and than using .from_pretrained method to load weights from local folder, but looks like they have different extensions or what, but it don't work and I still haven't found an reliable solution(
RUN git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
model_path = "mistralai/Mistral-7B-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(
f"./{model_path.split('/')[1]}/",
local_files_only=True)
And getting error SafetensorError: Error while deserializing header: HeaderTooLarge
I'm not sure you can run code like that? Like that f"string split. That's like python code? lol. Idk, Im not a docker expert
But you can put that in a bash script or something
Another method is to Clone this locally to your computer
and then do a COPY folderwhereyoucopied /dockerlocationfolder
that is also a valid method