Maintenance - only a Community Cloud issue?
Hey there!
I just started a new pod and noticed this maintenance window.
Is this only a thing on community cloud or also on secure cloud?
I'm trying to find a way to host a model 24/7 for a chat bot and a three day maintenance window ain't gonna cut it. π
Looking forward to feedback. π«Ά
215 Replies
hello
wait, yeah these are on both secure and community cloud i think
Really? But how can you run a production system with a three day downtime? π
i dont know what to suggest, those are mandatory right. but from what i think if its possible try to create another pod for HA
In what way mandatory? π
I feel stupid, but
HA
? πSorry high availability
like creating some other pods for backup
something like that
its not that frequent ig:
https://docs.runpod.io/hosting/maintenance-and-reliability
like software upgrades, server maintanance
But, why that? One pod alone is already as expensive as the GCP server I'm running. And GCP doesn't shut it down for maintenance. π I wanted to switch because runpod is more convenient and I prefer to support smaller services instead of the big clouds. But if I end up paying 2x what I pay for GCP, then it doesn't make sense. π
But I don't want that. Never touch a running system. I don't want someone to fumble with the server that works. π
That's why I'm so confused.
You don't just update a server just to update a server. π
Whaat 2x?
Yup.
really how is that possible
My GCP costs me $250 per month for a 24/7 uptime. I've deployed Ollama with llama2 there and it runs just fine. The smallest community cloud here would be lower than that ($0.14/h) but even the smallest secure cloud would be the same price ($0.34/h == $250/M).
So, if i need two, that's $500 each month.
yeah i cant do something about that, but if you're unhappy with the maintanances i guess you can ask support in webchat or from email about those
Ah, don't worry. It wasn't a complaint, just confused. π
I gotta send them a message I guess. Maybe there is a way to opt out?
yeah that makes me wonder too
yeah sure, opt out like what?
Opt out of touching the server I pay for. π
like stopping it?
or removing it
on maintanance it wont be billed too btw
Well, that might be true. But I would be loosing users and that costs me money. π
I need 24/7 uptime.
I'm running a chatbot that's used worldwide.
i see yeah that uptime is important
try the webchat sir, might give you other info about this
Yup, did that right away when you said it. I didn't see there is one. π
They are offline at the moment, so we'll have to wait for an answer. π
ah
maintance is for both secure and community sometimes we need update drivers etc
I think your best bet is to use serverless?
Serverless with a minimum of like ex. 2 active workers, and max of 10 workers
that way you can also scale as necessary
+ also that way pods will autocycle if something goes wrong
What are active workers in that context? Continuously incurring cost or only when I actually call the endpoint?
Is # workers == # parallel calls possible?
Continuously occuring cost, but at a 40% discount.
Yes, so u can have 0 min workers, 10 max workers
if u only wanna pay for up time
just there will be cold start times
To spin up a gpu etc
Yeah, that's the issue. I need answers within seconds. π
Yea so u can have
min worker of 1 or 2
with a max of 10 for ex
That way u always have something running
But because of cold starts, they woulnd't be ready in time I assume?
Parallel function calls to ur handler is possible
can read their doc for example but u can have a concurrency modifier for how many requests a worker can handle. need to be careful not to blow up the gpu memory tho
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py
GitHub
Runpod-OpenLLM-Pod-and-Serverless/handler.py at main Β· justinwlin/R...
A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.
On first call would be slow, but u can adjust an βidle timeβ a worker sits in before shutting down
so maybe the first call is slow for new workers
but it can stay online for X seconds or minutes
however u define and avoid more cold start
Ud only be paying cold start if u exceed ur minimum workers
That might be an idea.
And new workers need to spin up to go active
GitHub
GitHub - ashleykleynhans/runpod-api: A collection of Python scripts...
A collection of Python scripts for calling the RunPod GraphQL API - ashleykleynhans/runpod-api
If u want to get rlly into it
u can dynamically set the minimum workers on an endpoint dynamically
using a different server
Yeah, I guess keeping two workers active, if at 40% cost, would still be cheaper than an on-demand server.
Serverless GPUs for AI Inference and Training
Serverless GPUs to deploy your ML models to production without worrying about infrastructure or scale.
I could do it time based depending on when most of my users are online. π€
their pricing for serverless
Yeah
If u have some sort of backend keeping track of stuff
im sure u could get away with one active
and if u see an increase useds
*users
scale the minimum workers to 2 or 3
and have those do parallel concurrency requests
Do you think those cold boot times are realistic?
it rlly rlly depends
For production workload, they have a cache mechanism called fastboot
where supposedly i heard someone else get pretty good times
i guess their fastboot is just to load the model into memory and stuff faster
basically if u have an active worker, more max workers, more requests coming in
ive heard these are more ideal
for the caching mechanism obvs
i have never hit such a production workload cause i just use it for one off tasks and projects
i have running
Is a worker a different pod or are all workers running on the same pod?
a worker is a different pod
I'm runing a chat bot with 100s of users and I do have parallel access there already.
Telegram
AI Companions
This group contains a collection of AI companions that you can use 24/7. For questions, feedback and anything else, please message @ai_companions_admin :)
I mean, u can just modify the concurrency modifier then like my code shows
and have a worker take on more requests in parallel
this shows it
and also another thread
https://discord.com/channels/912829806415085598/1200525738449846342
with a more primitive example when i was trying to initially get it working before
Oh nice, I have to look into that.
yea prob i just never wrote code to dynamically max out based off of gpu memory which is what i think the best way
to handle it is
but u could if u knew for ex
a normal safe amt
just hard code it
Yeah, GPU mem is a thing ... but if each worker has its own pod, that should be fine?
the concurrency modifier supposedly can be dynamically modified
yea
i guess depends
how ur code handles ur gpu
memory getting blown up lol
i think most gpus just crash the system
but some libraries prevents it from happening
i had some instances where im running like three llms
and i blew up the memory
hmm
i mean it sounds like
if ur doing it already
with a pod
it isnt an issue
at the end the day, ur right its just a pod
My production system is on GCP at the moment, but I'm playing around with runpod. Running one pod only though, on-demand.
got it
Another option is fly.io to also look into
but i do think runpodβs serverless offering is what i like the most
and ease of testing on a pod
Never heard of that, I gotta look into that. π
but fly.io might also be a good gpu alternative provider
they also do dynamic scaling on requests i think? at least for their normal cpu stuff
havent tried their gpu offering yet
also more expensive than runpod, prob cheaper than gcp
But more established i think fly.io is
Are those the smallest GPUs they got?
You'd be surprised. π
yea xD why i havent moved to them yet
I pay $260 per month for GCP, but it's 24/7.
I actually haven't found a cheapeer alternative.
runpod prob is the best i think in terms cost
but hmmm, has its quirks
dang
what kinda gpu is running for gcp to get that low ποΈ
Look at the secure cloud on-demand offer on runpod, same price. π
NVIdia T4, has been fine so far.
The GPU is funnily enough not my problem so far. π
yeaaa lol. i guess cause i use serverless
It's the model, I want to upgrade to llama3 and in doing so I started looking into the easiest way to run it.
that rlly where i think runpodβs best offering rn is as a differentiator
That's how I came to runpiod.
I have to look into serverless I guess. π
my cost rn i keep much lower than other providers since i dont need to pay 24/7
but if u got users, and only doing 260 a month
that pretty good
Yeah, I need 24/7, there is no other way.
yea, i think try with serverless, min: 1, max: 10
maybe stress test or shadow mode run it
could be worth it
π€·
gl gl!
Thank you so much for all that input! β€οΈ π«Ά
How does it work though? When I try to use this vLLM quick deploy, there are not templates I can choose from. The list is just empty. Same when I try 'New Endpoint' - am I dumb or is the site buggy? π
its optional right
just fill in what is required or what you need to configure then
kk, lemme try
alright
did it work
Nah, no clue how this works. π I've deployed something (https://huggingface.co/Undi95/Llama3-Unholy-8B-OAS - which works with vLLM) and then tried this request tab and the request is just stuck, sitting there but all 5 workers running, happyly ticking away the money. π
whaat
cancel the request bro
who this works? what do you mean by that?
Sorry, 'how'.
How this works.
I don't know how this serverless interface works.
well i think vllm works with openai's library
wait leme find the docs
OpenAI compatibility | RunPod Documentation
The vLLM Worker is compatible with OpenAI's API, so you can use the same code to interact with the vLLM Worker as you would with OpenAI's API.
try using that
That's what I assumed the request on the website does so that I don't even have to write code. π
IF that already doesn't work, do you expect my code to work?
wait did you setup network storage with that?
i think its downloading the model if not wrong? it needs to store them in network storage
I guess there is none by default. Funny. D:
Yeah, you're right of course. That won't work.
Lemme add some.
hmm yeah maybe default they should recommend setup network volume when creating vllm
And then, I have no clue which data center to choose because they have different GPUs there, which I don't even understand how that's connected.
My workders have all kinds of GPUs.
yeah nice
pick any
btw i'd suggest cancelling the request and trashing the running workers first
I'll just pick the US one because they've probably got most there.
okay ok
any works
cancelled. π How can I trash the workers?
clikc on the green one
its like a button
then theres trash logo
press it and confirm
Ah, found it.
heres if you wanna test using curl later but i'd suggest use the python code from the docs
yes
The workers automatically get moved into stale actually and new ones initialize. So at least that one is done for me. π
ok after you setup the network volume
try sending 1 request first, and see the log
Waiting for them to come up now.
yeh great
they moving region i think
I assume so too.
it should be downloading first, not sure never used vllm template before hahah
I ended up clicking Romania, because this one had the A4500 in it but since the workers get reset, that actually didn't matter. I could have chosen any data center.
later after 1 running please send me the logs i want to see them too
yeaah
It has too, yeah. That model is not just randomly available. π
which do you prefer or just test around the gpus
what gpu did you use in gcp btw?
NVidia T4, the smallest one they got.
Oh
is that like 16gb ram here?
Yes, iirc.
hahahah
I created a 30GB volume, not sure how that can be.
oof
so fast
how big is your model hahah
The model is like 5GB.
hmm im clueless about what the template puts in there
but if you want to check it out you can create a gpu pod (not necessary)
or checkout the template's github
Yeah, not a fan of intransparent templates. Had I just booted up a vanilla Ubuntu on AWS, I would already be done. π
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
no really, this doesnt look like 5gb's
isnt it true that you need to download all of em to get them to working?
You might be right. I'm used to using models where they got one file per model version. But this is zipped up, you need all the files.
Ok, 30GB ain't gonna cut it.
looks like 32gb+ π€£
I need to add like a 100GB volume or so, because I'm sure unpacked it's even bigger.
nah nah try 40 first
Alright, let's try again.. haha
later if it runs out of space you can scale it up again hahah
Brb
Now we're stuck in initializing. π
btw, I've shared this convo in the bloke discord where I got the pointer to come over here. We're trying to figure out runpod over there: https://discord.com/channels/1111983596572520458/1112353569077735595
π
It's still in init. I'll just create a fresh endpoint, this time with the correct config. π
Ah, there is a container disk config right there actually. I've set that to 50 now.
Ah wait, that's the blank template ...
I'll try vLLM of course.
That's the right one. Default there is 30GB, just shy of the correct size. π
I've set it to 50 GB now.
I'm not in server yet so I can't see that channel link
Okok
I'm basically doing the same in Python:
Not much success though:
Checking the logs now.
btw: The cold start time is of course useless. π 90s haha
Woah the logs so blurry pls copy and paste the text instead
Maybe only the first time boot ah
What's the logs like
wow, that screenshot is uselss. I wonder why
I thought I'll try it through the request interface as well, but also, no success.
I think the right way for vllm is from the openai endpoints
Not the run or runsync
Ohhh, I didn't realize 'run' is the 'run' endpoint, not just 'run the query'. Bad naming. π
What's the difference?
What's next
That's it.
Isn't there a running worker
You can press it and press logs
Ha, indeed.
That UI is... not easy to navigate. π
Hmm I don't get what do you mean but it's should be handled by the worker, and the openai support is kinda new thing so have to check on the docs
The logs kinda bad yeah
Just started another run via Python:
Only the last line is the latest request.
HTTP 502
And super helpful logs.
Kek
A little searching helps
Thejob has missing fields input thingy is from the web's run request right?
Not from your python file?
No, it's my python request.
Only the last one
Or else the inputs the problem
at 13:41
There is no input. π
That's why hahah
Try input then
That's what I did.
Try use openais package instead bro
Hey Daddy!
Hello papa madiator
Fascinating. π
Ah it works or what
Bruh local llm?
It does. Now I have to figure out what it costs and if it's fast enough. π
No lol. That's the serverless endpoint.
Ohh
Finnaly
Use the openai package bro *3
The code is just in a file called
playground.py
.Ohhh
And more importantly, how it handles paralle requests.
And how to use the model I want to use, which is not compatible with vLLM actually. π
Sure try use threads hahahah or callbacks if there is any
Use the openai package π₯¦
Custom handler
πWell a little bit coding is fun
Goodluck
gz on the test setup, and thanks for linking @!x.com/dominicfrei i hope you will write a nice markdown with the example setup commands/screenshots
Well, we're not quite there yet. π
@nerdylive @Wolfsauge I've now added that serverless vLLM thing to my bot, but that's how the answers look like. Since I'm going through the OpenAI API I wonder, what can I do about that?
cc @drycoco
Not quite there yet but I'm working on it. But maybe you can help me with that question. I sometimes get 2min cold boots. What's your advice on workers (active, etc.) and how to make sure that cold boots never take long. π
Do I need to keep active workers at 1? How does the serveless vLLM template parallel requests?
Also, should I create new threads for those questions? We've been drifting away from the original topic quite a bit. π
that's bugs which meta fixed after releasing the models
it's in the config files.. generation_config.json, tokenizer_config.json, etc
13 fixes after model release, one or more of them is required to fix the astray eot_id etc eos token, which aren't properly recognized
however, also vllm-0.4.0 has similar symptoms, even when using the fixed model
the minimum vllm version which works for me and llama 3 is 0.4.1
to make it work you need both the fixes in the model files and the updated app, in this case vllm
last time i've checked, the non-gated reuploads of the latest meta models by nousresearch https://huggingface.co/NousResearch and their derivatives sometimes have the issues in generation_config and tokenizer_config fixed, but in different ways than meta - but i don't want to complicate things unecessarily
chances are your vllm is out of date or the model files you're using aren't "fixed enough" in this or the other way. and it's probably both
You always know everything, @Wolfsauge π
Unfortunately, the serverless config (https://github.com/runpod-workers/worker-vllm) is vLLM 0.3, so what can we do about that? π€
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
i don't know. i believe there is no other way than updating to 0.4.1 for llama 3
But I don't have access to the underlying pod in serverless, I can only deploy templates provided in runpod (if I understand correctly).
@Papa Madiator Maybe you can clarify? π
No you can make own templates to o
Ah, cool! Can you point me to the documentation about how that works on runpod and how to get started? π
Wdym?
How can I create my own template? It didn't 'click' in my head yet how to get started? At least not when posting that message. I've done some research on my own since then. Can I just a new docker image locally, push it to DockerHub and then all I need is the name for the image here?
Build docker container push to docker registry then add new docker image to template
Great. Wasn't obvious to me that it's that easy. π
Thank you!
I guess I have to try that next to get to my goal. haha
So, I got vLLM running locally now to test it out and see if that's an option. The results are really cool, but not the right approach for me. π
100 completions in about 100s is amazing for the total result. But the individual completion is too slow. I wonder if tabbyAPI (with sequential requests) and multiple parallel workers might actually be better. I expect no more than 2-3 requests at the same time for now. And no more than 5 for a while (assumptions can be wrong of course).
Hi @Papa Madiator ,
This is a weird question, but what if I'm too GPU-poor on my laptop to handle the build and a test of a docker image (model included) for a llama3 70B quantized model, for example?
(not exactly picked randomly; it's currently my need :))
I'm ending up popping an instance on GCP to build and test my image before pushing it to Docker Hub, making it accessible for Rundod. Is this the correct workflow?
Is there any chance we can exit the running container from inside the POD and reach the underlying Linux where we can
docker build . [...] && docker push
?GPU-poor? meaning your hardware GPU requirement isnt enough to build docker image?
building doesnt need gpu power except you run the container locally
also another alternative you can also test on runpod too
No you can't get access to the main linux host
Hi,
Thanks for your prompt answer.
Yes, "GPU-poor" because my laptop has a really bad consumer-class GPU.
You might need GPU access even during build time if, for example, you download a model from HF with vllm. But even if you don't have the need at build time, I will need it at run time for the test, and as mentioned earlier, unfortunately, I can't do that locally.
So, what is the workflow you advised? I'm not sure I fully grasp your second point:
also another alternative you can also test on runpod too
?
Thanks a lot πwell you build the image, then deploy it to runpod that whats i meant haha
ahah
not really effective i guess but for using the gpu i think thats the only way?
downloading with vllm needs gpu huh? never knew that
does that works for you?
I will rephrase it to be more accurate. I'm using the HF tool to download the model within my image (
AutoTokenizer.from_pretrained
, AutoModelForCausalLM.from_pretrained
), and they raised an error if they can't find the GPU. To be more specific, it's actually quantization_config.py
, which complains about not finding a GPU for AWQ. I don't think the GPU is involved at any point during downloading; I guess it's just a check to avoid uncesseray model downloading if you don't have a GPU. But I admit I have been lazy, and I will find a way to overcome that point (by downloading the model with another tool) ^^Oh yeah correct if you're using that code
it checks for gpu
try use git on hf maybe? i have no experience for downloading huggingface models without the code sorry π
Thanks for the advice. I will give it a try.
Just to summarize my understanding of the workflow:
1. Build the image locally (in my case: vllm, model (50go) and runpod required handler)
2. push it to dockerhub
3. deploy a pod/serverless on runpod and give it a try
4. if the output is not as expected, restart from Step 1
M'I correct?
Yep yep
like that
wow nice formatting
Personally, what I would do just as advice:
https://discord.com/channels/912829806415085598/1194695853026328626
1) Start up a Pytorch Template from Runpod with a GPU pod
2) Manually go through the steps and record it all on the side
3) From there, becomes much easier to confirm if it works
4) Then push to dockerhub
5) Then confirm that your template works by downloading it on the Pod
The reason why is b/c then you can use the pytorch template as a base, and it comes with a lot of stuff in the background
https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless
Can look to how I did it here, I even modified it, so that I overwrite Runpod's start.sh script, so that I can launch jupyter server without a password
Otherwise for modified templates with a jupyter notebook server, it will usually ask you for a token, which is a bit annoying.
Is pushing large-size images (> 10go) on Dockerhub painful (really slow) just for me (and my bad upload internet connection), or is it a pain in the workflow for everyone?
Oof
Hey try building in docker cloud or depot dev
Docker cloud build search that up
Or github runners if possible too
Thanks, Justin,
I checked your repository before asking my question here.
For context, I need to create a working Docker image containing a llama3 70B quantized model. Currently (as other users mentioned in this channel), it's not financially viable due to the delay time (around 120s) using the Vllm template. This implies at 0.0025 $/s a 0.3$ cost to pop the serverless endpoint. On the runpod doc and on repositories affiliated with runpod, it's mentioned that including the model inside the image reduces the delay time (some users on r/LocalLLama claim impressive figures using this technique). To do so and to test if it works, I was renting an instance on GCP (A100) to build and test my docker image, and I came here to ask if it was the right workflow π
All of that to say, I'm not sure how your repository (and your workflow) will fit my need for my usecase; wdyt ?
Thanks π
Thanks π
Showing my repository to show you the Dockerfile haha.
Just wondering, why not use like an API by anthropic or mixtral - or are you in prod and why you're looking to optimize costs by self hosting
Ahaha thanks actually I used some part of your dockerfile π
I think the repository I use can also use Llama3, even though not on the github readme
I was reading the OpenLLM issues
some people say that you can use the llama3
with a vllm backend
Obvs my repo is different than an officially supported runpod image tho
but just cause i like my methodology more
Also yes, including a llama3 model inside the image is much faster
the issue is that im not sure a 70B model is able to load easily
once images get to that size, to spin up docker images / build it becomes painful
Why I think using an API for LLMs tends to be better since they solve a lot of problems
Like ex. a llama2 70b model is above 100gb:
https://hub.docker.com/layers/justinwlin/llama2_70b_openllm/latest/images/sha256-c481f8dd51481daf995bfbe7075ee44bf0e1dc680cbda767cc97fcb7fd83f5a4?context=repo
at least when i tried to build it
This is a painful initialization time, so better for models this big to be stored in a network volume, if you really want to self-host your own llm
The idea is to validate that a specific 70B llama3 quantized is good enough for our use case (which implies non-technical teams playing with the model) before using it with Vllm on batching mode (therefore no serverless), with millions of prompts to pass every X.
Got it
You can probably just:
1) Start up a Pytorch template with a network volume attach
2) Download the 70B model from that GPU Pod (or use a CPU Pod since this is just for pure downloading and not GPU processing)
3) in the future, you can always launch new pods all attached to the network volume
and they will all have access to the 70B model π
that is what I would do
Make sure that whatever region you are using for network volume, has the correct GPU Pod that you want to test with avaliable
I haven't try yet, but your solution just above was also recommended here: https://www.reddit.com/r/SillyTavernAI/comments/1app7gv/new_guides_and_worker_container_for/
Reddit
From the SillyTavernAI community on Reddit: New guides and worker c...
Explore this post and more from the SillyTavernAI community
ooo nice
They also mentioned that even with this solution, they still have some delay time (not 120s of, but still). Some mention ms of delay time when the model is included, but you might be right; size matters...
Do you remember the ~build time for that ?
No problem
I don't know, I use something called depot.dev though for it, and had a 500gb cache to build it lol. Not sure it worth it though, I wouldn't do it again for any images > 35GB size
I really think that's why for anything with more complex tasks for LLMs, I just use paid API services, and go through a finetuning process, or I just use the avaliable zeroshot model
It just isn't worth self-hosting an LLM unless I really dial it in, and it needs to be for a smaller model like a mistral7b
Thanks for the feedback π
That's what I'm currently considering. Building the image myself using tabbyAPI and https://huggingface.co/hjhj3168/Llama-3-8b-Orthogonalized-exl2
I wish there was anyone out there, offering a service that's running https://huggingface.co/hjhj3168/Llama-3-8b-Orthogonalized-exl2 or https://huggingface.co/Undi95/Llama3-Unholy-8B-OAS but it seems like I have to host them myself. π
Like any decent unceonsred RP capable model it seems.
Or would you know of any?
I've ever heard some websites offering hosted models for Roleplay try finding some AI chat Roleplay websites