Maintenance - only a Community Cloud issue?

Hey there! I just started a new pod and noticed this maintenance window. Is this only a thing on community cloud or also on secure cloud? I'm trying to find a way to host a model 24/7 for a chat bot and a three day maintenance window ain't gonna cut it. πŸ™‚ Looking forward to feedback. 🫢
No description
215 Replies
nerdylive
nerdyliveβ€’8mo ago
hello wait, yeah these are on both secure and community cloud i think
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Really? But how can you run a production system with a three day downtime? πŸ˜„
nerdylive
nerdyliveβ€’8mo ago
i dont know what to suggest, those are mandatory right. but from what i think if its possible try to create another pod for HA
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
In what way mandatory? πŸ™‚ I feel stupid, but HA? πŸ™‚
nerdylive
nerdyliveβ€’8mo ago
Sorry high availability like creating some other pods for backup something like that
nerdylive
nerdyliveβ€’8mo ago
like software upgrades, server maintanance
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
But, why that? One pod alone is already as expensive as the GCP server I'm running. And GCP doesn't shut it down for maintenance. πŸ˜„ I wanted to switch because runpod is more convenient and I prefer to support smaller services instead of the big clouds. But if I end up paying 2x what I pay for GCP, then it doesn't make sense. πŸ˜„ But I don't want that. Never touch a running system. I don't want someone to fumble with the server that works. πŸ˜„ That's why I'm so confused. You don't just update a server just to update a server. πŸ™‚
nerdylive
nerdyliveβ€’8mo ago
Whaat 2x?
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Yup.
nerdylive
nerdyliveβ€’8mo ago
really how is that possible
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
My GCP costs me $250 per month for a 24/7 uptime. I've deployed Ollama with llama2 there and it runs just fine. The smallest community cloud here would be lower than that ($0.14/h) but even the smallest secure cloud would be the same price ($0.34/h == $250/M). So, if i need two, that's $500 each month.
nerdylive
nerdyliveβ€’8mo ago
yeah i cant do something about that, but if you're unhappy with the maintanances i guess you can ask support in webchat or from email about those
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Ah, don't worry. It wasn't a complaint, just confused. πŸ™‚ I gotta send them a message I guess. Maybe there is a way to opt out?
nerdylive
nerdyliveβ€’8mo ago
yeah that makes me wonder too yeah sure, opt out like what?
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Opt out of touching the server I pay for. πŸ˜„
nerdylive
nerdyliveβ€’8mo ago
like stopping it? or removing it on maintanance it wont be billed too btw
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Well, that might be true. But I would be loosing users and that costs me money. πŸ˜‰ I need 24/7 uptime. I'm running a chatbot that's used worldwide.
nerdylive
nerdyliveβ€’8mo ago
i see yeah that uptime is important try the webchat sir, might give you other info about this
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Yup, did that right away when you said it. I didn't see there is one. πŸ˜„ They are offline at the moment, so we'll have to wait for an answer. πŸ™‚
nerdylive
nerdyliveβ€’8mo ago
ah
Madiator2011
Madiator2011β€’8mo ago
maintance is for both secure and community sometimes we need update drivers etc
justin
justinβ€’8mo ago
I think your best bet is to use serverless? Serverless with a minimum of like ex. 2 active workers, and max of 10 workers that way you can also scale as necessary + also that way pods will autocycle if something goes wrong
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
What are active workers in that context? Continuously incurring cost or only when I actually call the endpoint? Is # workers == # parallel calls possible?
justin
justinβ€’8mo ago
Continuously occuring cost, but at a 40% discount. Yes, so u can have 0 min workers, 10 max workers if u only wanna pay for up time just there will be cold start times To spin up a gpu etc
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Yeah, that's the issue. I need answers within seconds. πŸ™‚
justin
justinβ€’8mo ago
Yea so u can have min worker of 1 or 2 with a max of 10 for ex That way u always have something running
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
But because of cold starts, they woulnd't be ready in time I assume?
justin
justinβ€’8mo ago
Parallel function calls to ur handler is possible can read their doc for example but u can have a concurrency modifier for how many requests a worker can handle. need to be careful not to blow up the gpu memory tho https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py
GitHub
Runpod-OpenLLM-Pod-and-Serverless/handler.py at main Β· justinwlin/R...
A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.
justin
justinβ€’8mo ago
On first call would be slow, but u can adjust an β€œidle time” a worker sits in before shutting down so maybe the first call is slow for new workers but it can stay online for X seconds or minutes however u define and avoid more cold start Ud only be paying cold start if u exceed ur minimum workers
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
That might be an idea.
justin
justinβ€’8mo ago
And new workers need to spin up to go active
justin
justinβ€’8mo ago
GitHub
GitHub - ashleykleynhans/runpod-api: A collection of Python scripts...
A collection of Python scripts for calling the RunPod GraphQL API - ashleykleynhans/runpod-api
justin
justinβ€’8mo ago
If u want to get rlly into it u can dynamically set the minimum workers on an endpoint dynamically using a different server
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Yeah, I guess keeping two workers active, if at 40% cost, would still be cheaper than an on-demand server.
justin
justinβ€’8mo ago
Serverless GPUs for AI Inference and Training
Serverless GPUs to deploy your ML models to production without worrying about infrastructure or scale.
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I could do it time based depending on when most of my users are online. πŸ€”
justin
justinβ€’8mo ago
their pricing for serverless Yeah If u have some sort of backend keeping track of stuff im sure u could get away with one active and if u see an increase useds *users scale the minimum workers to 2 or 3 and have those do parallel concurrency requests
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Do you think those cold boot times are realistic?
StableDiffusion
210ms
271ms
StableDiffusion
210ms
271ms
justin
justinβ€’8mo ago
it rlly rlly depends For production workload, they have a cache mechanism called fastboot where supposedly i heard someone else get pretty good times i guess their fastboot is just to load the model into memory and stuff faster basically if u have an active worker, more max workers, more requests coming in ive heard these are more ideal for the caching mechanism obvs i have never hit such a production workload cause i just use it for one off tasks and projects i have running
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Is a worker a different pod or are all workers running on the same pod?
justin
justinβ€’8mo ago
a worker is a different pod
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I'm runing a chat bot with 100s of users and I do have parallel access there already.
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Telegram
AI Companions
This group contains a collection of AI companions that you can use 24/7. For questions, feedback and anything else, please message @ai_companions_admin :)
justin
justinβ€’8mo ago
I mean, u can just modify the concurrency modifier then like my code shows and have a worker take on more requests in parallel this shows it and also another thread https://discord.com/channels/912829806415085598/1200525738449846342 with a more primitive example when i was trying to initially get it working before
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Oh nice, I have to look into that.
justin
justinβ€’8mo ago
yea prob i just never wrote code to dynamically max out based off of gpu memory which is what i think the best way to handle it is but u could if u knew for ex a normal safe amt just hard code it
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Yeah, GPU mem is a thing ... but if each worker has its own pod, that should be fine?
justin
justinβ€’8mo ago
the concurrency modifier supposedly can be dynamically modified yea i guess depends how ur code handles ur gpu memory getting blown up lol i think most gpus just crash the system but some libraries prevents it from happening i had some instances where im running like three llms and i blew up the memory
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
hmm
justin
justinβ€’8mo ago
i mean it sounds like if ur doing it already with a pod it isnt an issue at the end the day, ur right its just a pod
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
My production system is on GCP at the moment, but I'm playing around with runpod. Running one pod only though, on-demand.
justin
justinβ€’8mo ago
got it Another option is fly.io to also look into but i do think runpod’s serverless offering is what i like the most and ease of testing on a pod
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Never heard of that, I gotta look into that. πŸ™‚
justin
justinβ€’8mo ago
but fly.io might also be a good gpu alternative provider they also do dynamic scaling on requests i think? at least for their normal cpu stuff havent tried their gpu offering yet also more expensive than runpod, prob cheaper than gcp But more established i think fly.io is
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
https://fly.io/docs/about/pricing/ Phew, that's expensive. πŸ˜„
Fly
Fly.io Resource Pricing
Documentation and guides from the team at Fly.io.
No description
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Are those the smallest GPUs they got? You'd be surprised. πŸ˜„
justin
justinβ€’8mo ago
yea xD why i havent moved to them yet
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I pay $260 per month for GCP, but it's 24/7. I actually haven't found a cheapeer alternative.
justin
justinβ€’8mo ago
runpod prob is the best i think in terms cost but hmmm, has its quirks dang what kinda gpu is running for gcp to get that low πŸ‘οΈ
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Look at the secure cloud on-demand offer on runpod, same price. πŸ˜‰ NVIdia T4, has been fine so far. The GPU is funnily enough not my problem so far. πŸ˜„
justin
justinβ€’8mo ago
yeaaa lol. i guess cause i use serverless
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
It's the model, I want to upgrade to llama3 and in doing so I started looking into the easiest way to run it.
justin
justinβ€’8mo ago
that rlly where i think runpod’s best offering rn is as a differentiator
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
That's how I came to runpiod. I have to look into serverless I guess. πŸ˜„
justin
justinβ€’8mo ago
my cost rn i keep much lower than other providers since i dont need to pay 24/7 but if u got users, and only doing 260 a month that pretty good
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Yeah, I need 24/7, there is no other way.
justin
justinβ€’8mo ago
yea, i think try with serverless, min: 1, max: 10 maybe stress test or shadow mode run it could be worth it 🀷 gl gl!
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Thank you so much for all that input! ❀️ 🫢
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
How does it work though? When I try to use this vLLM quick deploy, there are not templates I can choose from. The list is just empty. Same when I try 'New Endpoint' - am I dumb or is the site buggy? πŸ˜„
No description
No description
nerdylive
nerdyliveβ€’8mo ago
its optional right just fill in what is required or what you need to configure then
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
kk, lemme try
nerdylive
nerdyliveβ€’8mo ago
alright did it work
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Nah, no clue how this works. πŸ˜„ I've deployed something (https://huggingface.co/Undi95/Llama3-Unholy-8B-OAS - which works with vLLM) and then tried this request tab and the request is just stuck, sitting there but all 5 workers running, happyly ticking away the money. πŸ˜„
No description
No description
nerdylive
nerdyliveβ€’8mo ago
whaat cancel the request bro who this works? what do you mean by that?
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Sorry, 'how'. How this works. I don't know how this serverless interface works.
nerdylive
nerdyliveβ€’8mo ago
well i think vllm works with openai's library wait leme find the docs
nerdylive
nerdyliveβ€’8mo ago
OpenAI compatibility | RunPod Documentation
The vLLM Worker is compatible with OpenAI's API, so you can use the same code to interact with the vLLM Worker as you would with OpenAI's API.
nerdylive
nerdyliveβ€’8mo ago
try using that
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
That's what I assumed the request on the website does so that I don't even have to write code. πŸ˜„ IF that already doesn't work, do you expect my code to work?
nerdylive
nerdyliveβ€’8mo ago
wait did you setup network storage with that? i think its downloading the model if not wrong? it needs to store them in network storage
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I guess there is none by default. Funny. D:
No description
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Yeah, you're right of course. That won't work. Lemme add some.
nerdylive
nerdyliveβ€’8mo ago
hmm yeah maybe default they should recommend setup network volume when creating vllm
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
And then, I have no clue which data center to choose because they have different GPUs there, which I don't even understand how that's connected.
No description
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
My workders have all kinds of GPUs.
nerdylive
nerdyliveβ€’8mo ago
yeah nice pick any
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
No description
No description
No description
No description
No description
nerdylive
nerdyliveβ€’8mo ago
btw i'd suggest cancelling the request and trashing the running workers first
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I'll just pick the US one because they've probably got most there.
nerdylive
nerdyliveβ€’8mo ago
okay ok any works
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
cancelled. πŸ‘ How can I trash the workers?
nerdylive
nerdyliveβ€’8mo ago
clikc on the green one its like a button then theres trash logo press it and confirm
curl https://api.runpod.ai/v2/${YOUR_ENDPOINT_ID}/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: ${RUNPOD_API_KEY}" \
-d '{
"model": "openchat/openchat-3.5-1210",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Reply with: Hello, World!"
}
]
}'
curl https://api.runpod.ai/v2/${YOUR_ENDPOINT_ID}/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: ${RUNPOD_API_KEY}" \
-d '{
"model": "openchat/openchat-3.5-1210",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Reply with: Hello, World!"
}
]
}'
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Ah, found it.
No description
nerdylive
nerdyliveβ€’8mo ago
heres if you wanna test using curl later but i'd suggest use the python code from the docs yes
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
The workers automatically get moved into stale actually and new ones initialize. So at least that one is done for me. πŸ˜„
nerdylive
nerdyliveβ€’8mo ago
ok after you setup the network volume try sending 1 request first, and see the log
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Waiting for them to come up now.
nerdylive
nerdyliveβ€’8mo ago
yeh great they moving region i think
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I assume so too.
nerdylive
nerdyliveβ€’8mo ago
it should be downloading first, not sure never used vllm template before hahah
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I ended up clicking Romania, because this one had the A4500 in it but since the workers get reset, that actually didn't matter. I could have chosen any data center.
nerdylive
nerdyliveβ€’8mo ago
later after 1 running please send me the logs i want to see them too yeaah
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
It has too, yeah. That model is not just randomly available. πŸ˜„
nerdylive
nerdyliveβ€’8mo ago
which do you prefer or just test around the gpus what gpu did you use in gcp btw?
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
NVidia T4, the smallest one they got.
nerdylive
nerdyliveβ€’8mo ago
Oh is that like 16gb ram here?
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Yes, iirc.
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
hahahah
No description
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I created a 30GB volume, not sure how that can be.
nerdylive
nerdyliveβ€’8mo ago
oof so fast
nerdylive
nerdyliveβ€’8mo ago
how big is your model hahah
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
The model is like 5GB.
nerdylive
nerdyliveβ€’8mo ago
hmm im clueless about what the template puts in there but if you want to check it out you can create a gpu pod (not necessary) or checkout the template's github
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Yeah, not a fan of intransparent templates. Had I just booted up a vanilla Ubuntu on AWS, I would already be done. πŸ˜„
nerdylive
nerdyliveβ€’8mo ago
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
nerdylive
nerdyliveβ€’8mo ago
no really, this doesnt look like 5gb's
No description
nerdylive
nerdyliveβ€’8mo ago
isnt it true that you need to download all of em to get them to working?
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
You might be right. I'm used to using models where they got one file per model version. But this is zipped up, you need all the files. Ok, 30GB ain't gonna cut it.
nerdylive
nerdyliveβ€’8mo ago
looks like 32gb+ 🀣
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I need to add like a 100GB volume or so, because I'm sure unpacked it's even bigger.
nerdylive
nerdyliveβ€’8mo ago
nah nah try 40 first
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Alright, let's try again.. haha
nerdylive
nerdyliveβ€’8mo ago
later if it runs out of space you can scale it up again hahah Brb
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Now we're stuck in initializing. πŸ˜„
No description
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
btw, I've shared this convo in the bloke discord where I got the pointer to come over here. We're trying to figure out runpod over there: https://discord.com/channels/1111983596572520458/1112353569077735595 πŸ˜„ It's still in init. I'll just create a fresh endpoint, this time with the correct config. πŸ˜„
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Ah, there is a container disk config right there actually. I've set that to 50 now.
No description
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Ah wait, that's the blank template ... I'll try vLLM of course.
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
That's the right one. Default there is 30GB, just shy of the correct size. πŸ˜„
No description
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I've set it to 50 GB now.
nerdylive
nerdyliveβ€’8mo ago
I'm not in server yet so I can't see that channel link Okok
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I'm basically doing the same in Python:
import requests

url: str = "https://api.runpod.ai/v2/vllm-zeul7en2ikchq9/openai/v1/chat/completions"

data = {"model": "Undi95/Llama3-Unholy-8B-OAS",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]}

headers = {
"Authorization": "Bearer XXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"Content-Type": "application/json"
}

response = requests.post(url, json=data, headers=headers)

print(response.json())
import requests

url: str = "https://api.runpod.ai/v2/vllm-zeul7en2ikchq9/openai/v1/chat/completions"

data = {"model": "Undi95/Llama3-Unholy-8B-OAS",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]}

headers = {
"Authorization": "Bearer XXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"Content-Type": "application/json"
}

response = requests.post(url, json=data, headers=headers)

print(response.json())
Not much success though:
Connected to pydev debugger (build 233.15026.15)
{'error': 'Error processing the request'}

Process finished with exit code 0
Connected to pydev debugger (build 233.15026.15)
{'error': 'Error processing the request'}

Process finished with exit code 0
Checking the logs now.
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
No description
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
btw: The cold start time is of course useless. πŸ˜„ 90s haha
No description
nerdylive
nerdyliveβ€’8mo ago
Woah the logs so blurry pls copy and paste the text instead Maybe only the first time boot ah What's the logs like
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
wow, that screenshot is uselss. I wonder why
2024-05-04 22:37:23.829
[shns1kb4sswpl4]
[error]
Job has missing field(s): input.
2024-05-04 22:37:23.829
[shns1kb4sswpl4]
[info]
4--- Starting Serverless Worker | Version 1.6.2 ---
2024-05-04 22:35:07.167
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
9INFO 05-04 13:35:06 serving_chat.py:298] ' }}{% endif %}
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
+INFO 05-04 13:35:06 serving_chat.py:298]
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
INFO 05-04 13:35:06 serving_chat.py:298] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
+INFO 05-04 13:35:06 serving_chat.py:298]
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
οΏ½INFO 05-04 13:35:06 serving_chat.py:298] {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
HINFO 05-04 13:35:06 serving_chat.py:298] Using supplied chat template:
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:35:06.126
[shns1kb4sswpl4]
[info]
MINFO 05-04 13:35:06 model_runner.py:756] Graph capturing finished in 6 secs.
2024-05-04 22:35:00.381
[shns1kb4sswpl4]
[info]
INFO 05-04 13:35:00 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-05-04 22:35:00.381
[shns1kb4sswpl4]
[info]
οΏ½INFO 05-04 13:35:00 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-05-04 22:34:59.115
[shns1kb4sswpl4]
[info]
NINFO 05-04 13:34:59 llm_engine.py:357] # GPU blocks: 1021, # CPU blocks: 2048
2024-05-04 22:34:56.786
[shns1kb4sswpl4]
[info]
HINFO 05-04 13:34:56 weight_utils.py:257] Loading safetensors took 3.75s
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
VINFO 05-04 13:33:53 weight_utils.py:163] Using model weights format ['*.safetensors']
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
οΏ½INFO 05-04 13:33:49 llm_engine.py:87] Initializing an LLM engine with config: model='Undi95/Llama3-Unholy-8B-OAS', tokenizer='Undi95/Llama3-Unholy-8B-OAS', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:37:23.829
[shns1kb4sswpl4]
[error]
Job has missing field(s): input.
2024-05-04 22:37:23.829
[shns1kb4sswpl4]
[info]
4--- Starting Serverless Worker | Version 1.6.2 ---
2024-05-04 22:35:07.167
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
9INFO 05-04 13:35:06 serving_chat.py:298] ' }}{% endif %}
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
+INFO 05-04 13:35:06 serving_chat.py:298]
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
INFO 05-04 13:35:06 serving_chat.py:298] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
+INFO 05-04 13:35:06 serving_chat.py:298]
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
οΏ½INFO 05-04 13:35:06 serving_chat.py:298] {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
HINFO 05-04 13:35:06 serving_chat.py:298] Using supplied chat template:
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:35:06.126
[shns1kb4sswpl4]
[info]
MINFO 05-04 13:35:06 model_runner.py:756] Graph capturing finished in 6 secs.
2024-05-04 22:35:00.381
[shns1kb4sswpl4]
[info]
INFO 05-04 13:35:00 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-05-04 22:35:00.381
[shns1kb4sswpl4]
[info]
οΏ½INFO 05-04 13:35:00 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-05-04 22:34:59.115
[shns1kb4sswpl4]
[info]
NINFO 05-04 13:34:59 llm_engine.py:357] # GPU blocks: 1021, # CPU blocks: 2048
2024-05-04 22:34:56.786
[shns1kb4sswpl4]
[info]
HINFO 05-04 13:34:56 weight_utils.py:257] Loading safetensors took 3.75s
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
VINFO 05-04 13:33:53 weight_utils.py:163] Using model weights format ['*.safetensors']
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
οΏ½INFO 05-04 13:33:49 llm_engine.py:87] Initializing an LLM engine with config: model='Undi95/Llama3-Unholy-8B-OAS', tokenizer='Undi95/Llama3-Unholy-8B-OAS', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
I thought I'll try it through the request interface as well, but also, no success.
No description
nerdylive
nerdyliveβ€’8mo ago
I think the right way for vllm is from the openai endpoints Not the run or runsync
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Ohhh, I didn't realize 'run' is the 'run' endpoint, not just 'run the query'. Bad naming. πŸ˜„ What's the difference?
nerdylive
nerdyliveβ€’8mo ago
What's next
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
That's it.
No description
nerdylive
nerdyliveβ€’8mo ago
Isn't there a running worker You can press it and press logs
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Ha, indeed. That UI is... not easy to navigate. πŸ˜„
nerdylive
nerdyliveβ€’8mo ago
Hmm I don't get what do you mean but it's should be handled by the worker, and the openai support is kinda new thing so have to check on the docs The logs kinda bad yeah
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Just started another run via Python:
2024-05-04T13:37:23.829697879Z --- Starting Serverless Worker | Version 1.6.2 ---
2024-05-04T13:37:23.829747990Z {"requestId": null, "message": "Job has missing field(s): input.", "level": "ERROR"}
2024-05-04T13:38:34.395918737Z {"requestId": null, "message": "Job has missing field(s): input.", "level": "ERROR"}
2024-05-04T13:39:22.656474814Z {"requestId": "sync-aed84f4f-98b5-4829-90b6-88c831a1528d-e1", "message": "Finished running generator.", "level": "INFO"}
2024-05-04T13:39:22.719179137Z {"requestId": "sync-aed84f4f-98b5-4829-90b6-88c831a1528d-e1", "message": "Finished.", "level": "INFO"}
2024-05-04T13:41:22.288701791Z {"requestId": null, "message": "Failed to get job, status code: 502", "level": "ERROR"}
2024-05-04T13:37:23.829697879Z --- Starting Serverless Worker | Version 1.6.2 ---
2024-05-04T13:37:23.829747990Z {"requestId": null, "message": "Job has missing field(s): input.", "level": "ERROR"}
2024-05-04T13:38:34.395918737Z {"requestId": null, "message": "Job has missing field(s): input.", "level": "ERROR"}
2024-05-04T13:39:22.656474814Z {"requestId": "sync-aed84f4f-98b5-4829-90b6-88c831a1528d-e1", "message": "Finished running generator.", "level": "INFO"}
2024-05-04T13:39:22.719179137Z {"requestId": "sync-aed84f4f-98b5-4829-90b6-88c831a1528d-e1", "message": "Finished.", "level": "INFO"}
2024-05-04T13:41:22.288701791Z {"requestId": null, "message": "Failed to get job, status code: 502", "level": "ERROR"}
Only the last line is the latest request. HTTP 502 And super helpful logs.
nerdylive
nerdyliveβ€’8mo ago
Kek A little searching helps Thejob has missing fields input thingy is from the web's run request right? Not from your python file?
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
No, it's my python request. Only the last one
nerdylive
nerdyliveβ€’8mo ago
Or else the inputs the problem
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
at 13:41 There is no input. πŸ˜„
nerdylive
nerdyliveβ€’8mo ago
That's why hahah
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
import requests

url: str = "https://api.runpod.ai/v2/vllm-zeul7en2ikchq9/openai/v1/chat/completions"

data = {"model": "Undi95/Llama3-Unholy-8B-OAS",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]}

headers = {
"Authorization": "Bearer XXXXXXXXXXXXXXXXXXXXXXXX",
"Content-Type": "application/json"
}

response = requests.post(url, json=data, headers=headers)

print(response.json())
import requests

url: str = "https://api.runpod.ai/v2/vllm-zeul7en2ikchq9/openai/v1/chat/completions"

data = {"model": "Undi95/Llama3-Unholy-8B-OAS",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]}

headers = {
"Authorization": "Bearer XXXXXXXXXXXXXXXXXXXXXXXX",
"Content-Type": "application/json"
}

response = requests.post(url, json=data, headers=headers)

print(response.json())
nerdylive
nerdyliveβ€’8mo ago
Try input then
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
That's what I did.
nerdylive
nerdyliveβ€’8mo ago
Try use openais package instead bro
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Hey Daddy!
nerdylive
nerdyliveβ€’8mo ago
Hello papa madiator
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
C:\Users\busin\Documents\GitHub\ai_girlfriend\src\.venv\Scripts\python.exe C:\Users\busin\Documents\GitHub\ai_girlfriend\playground\playground.py
I'm not familiar with RunPod, and I don't have enough information to determine whether it's the best platform or not. Can you please provide more context or details about RunPod and what it does? This will help me better understand your question and provide a more accurate response.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I see what's going on here! As a conversational AI, I don't have any prior knowledge about RunPod, and I'm not biased towards any particular platform. I'm here

Process finished with exit code 0
C:\Users\busin\Documents\GitHub\ai_girlfriend\src\.venv\Scripts\python.exe C:\Users\busin\Documents\GitHub\ai_girlfriend\playground\playground.py
I'm not familiar with RunPod, and I don't have enough information to determine whether it's the best platform or not. Can you please provide more context or details about RunPod and what it does? This will help me better understand your question and provide a more accurate response.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I see what's going on here! As a conversational AI, I don't have any prior knowledge about RunPod, and I'm not biased towards any particular platform. I'm here

Process finished with exit code 0
Fascinating. πŸ˜„
nerdylive
nerdyliveβ€’8mo ago
Ah it works or what Bruh local llm?
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
It does. Now I have to figure out what it costs and if it's fast enough. πŸ˜„ No lol. That's the serverless endpoint.
nerdylive
nerdyliveβ€’8mo ago
Ohh Finnaly
nerdylive
nerdyliveβ€’8mo ago
Use the openai package bro *3
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
The code is just in a file called playground.py.
nerdylive
nerdyliveβ€’8mo ago
Ohhh
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
And more importantly, how it handles paralle requests. And how to use the model I want to use, which is not compatible with vLLM actually. πŸ˜„
nerdylive
nerdyliveβ€’8mo ago
Sure try use threads hahahah or callbacks if there is any Use the openai package πŸ₯¦ Custom handler πŸ™Well a little bit coding is fun Goodluck
Wolfsauge
Wolfsaugeβ€’8mo ago
gz on the test setup, and thanks for linking @!x.com/dominicfrei i hope you will write a nice markdown with the example setup commands/screenshots
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Well, we're not quite there yet. πŸ˜‰
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
@nerdylive @Wolfsauge I've now added that serverless vLLM thing to my bot, but that's how the answers look like. Since I'm going through the OpenAI API I wonder, what can I do about that?
No description
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
cc @drycoco Not quite there yet but I'm working on it. But maybe you can help me with that question. I sometimes get 2min cold boots. What's your advice on workers (active, etc.) and how to make sure that cold boots never take long. πŸ™‚ Do I need to keep active workers at 1? How does the serveless vLLM template parallel requests? Also, should I create new threads for those questions? We've been drifting away from the original topic quite a bit. πŸ˜„
Wolfsauge
Wolfsaugeβ€’8mo ago
that's bugs which meta fixed after releasing the models it's in the config files.. generation_config.json, tokenizer_config.json, etc
Wolfsauge
Wolfsaugeβ€’8mo ago
13 fixes after model release, one or more of them is required to fix the astray eot_id etc eos token, which aren't properly recognized
No description
Wolfsauge
Wolfsaugeβ€’8mo ago
however, also vllm-0.4.0 has similar symptoms, even when using the fixed model the minimum vllm version which works for me and llama 3 is 0.4.1 to make it work you need both the fixes in the model files and the updated app, in this case vllm last time i've checked, the non-gated reuploads of the latest meta models by nousresearch https://huggingface.co/NousResearch and their derivatives sometimes have the issues in generation_config and tokenizer_config fixed, but in different ways than meta - but i don't want to complicate things unecessarily chances are your vllm is out of date or the model files you're using aren't "fixed enough" in this or the other way. and it's probably both
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
You always know everything, @Wolfsauge 😍 Unfortunately, the serverless config (https://github.com/runpod-workers/worker-vllm) is vLLM 0.3, so what can we do about that? πŸ€”
GitHub
GitHub - runpod-workers/worker-vllm: The RunPod worker template for...
The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm
Wolfsauge
Wolfsaugeβ€’8mo ago
i don't know. i believe there is no other way than updating to 0.4.1 for llama 3
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
But I don't have access to the underlying pod in serverless, I can only deploy templates provided in runpod (if I understand correctly). @Papa Madiator Maybe you can clarify? πŸ™‚
Madiator2011
Madiator2011β€’8mo ago
No you can make own templates to o
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Ah, cool! Can you point me to the documentation about how that works on runpod and how to get started? πŸ™‚
Madiator2011
Madiator2011β€’8mo ago
Wdym?
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
How can I create my own template? It didn't 'click' in my head yet how to get started? At least not when posting that message. I've done some research on my own since then. Can I just a new docker image locally, push it to DockerHub and then all I need is the name for the image here?
No description
Madiator2011
Madiator2011β€’8mo ago
Build docker container push to docker registry then add new docker image to template
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
Great. Wasn't obvious to me that it's that easy. πŸ˜„ Thank you! I guess I have to try that next to get to my goal. haha So, I got vLLM running locally now to test it out and see if that's an option. The results are really cool, but not the right approach for me. πŸ˜„
============================= test session starts =============================
collecting ... collected 1 item

test_model_wrapper.py::test_complete

======================== 1 passed in 100.25s (0:01:40) ========================
PASSED [100%]

Time to set up: 0.1114599 seconds


Runs: [31.6107654, 31.6132964, 34.2333528, 34.2130582, 34.2042625, 34.2340227, 34.2474756, 34.2105221, 34.2394911, 34.2595923, 34.2379803, 34.2343132, 34.2297294, 34.2636501, 34.223144, 34.2007176, 34.2261261, 34.1895347, 34.2410051, 34.2694363, 34.2138979, 34.219749, 34.2380913, 34.2763124, 34.2519246, 34.24931, 34.2471614, 34.2769107, 42.0093234, 42.0015118, 42.0404314, 42.0552968, 42.0382447, 42.0084274, 42.0031958, 42.054302, 42.002341, 42.0075319, 42.0172066, 47.9992168, 47.953934, 47.9645648, 48.0265022, 48.0249268, 47.951652, 48.0254267, 47.9690977, 49.3527739, 49.3451308, 50.7696567, 50.767339, 50.7534204, 50.775435, 50.759712, 50.7630917, 50.7589791, 50.7768211, 53.4804957, 53.4305316, 56.1890333, 56.1694433, 56.1409132, 56.1523421, 56.1788844, 56.1737424, 56.1204063, 56.1047533, 56.1247003, 56.1293441, 56.1116698, 56.1414604, 57.3100886, 57.2900195, 57.3443251, 58.5790726, 59.776073, 59.6932115, 59.7322922, 59.7565171, 59.7780871, 59.7480857, 60.7950622, 63.1008595, 62.9985082, 63.0248589, 63.0549046, 64.3303572, 64.3385164, 64.3284183, 68.3025694, 68.2875625, 69.6675595, 69.6902716, 70.5738979, 70.6244101, 70.5820796, 71.4233821, 78.2565978, 81.2923646, 99.9321581]


Average run time: 49.985901578 seconds
============================= test session starts =============================
collecting ... collected 1 item

test_model_wrapper.py::test_complete

======================== 1 passed in 100.25s (0:01:40) ========================
PASSED [100%]

Time to set up: 0.1114599 seconds


Runs: [31.6107654, 31.6132964, 34.2333528, 34.2130582, 34.2042625, 34.2340227, 34.2474756, 34.2105221, 34.2394911, 34.2595923, 34.2379803, 34.2343132, 34.2297294, 34.2636501, 34.223144, 34.2007176, 34.2261261, 34.1895347, 34.2410051, 34.2694363, 34.2138979, 34.219749, 34.2380913, 34.2763124, 34.2519246, 34.24931, 34.2471614, 34.2769107, 42.0093234, 42.0015118, 42.0404314, 42.0552968, 42.0382447, 42.0084274, 42.0031958, 42.054302, 42.002341, 42.0075319, 42.0172066, 47.9992168, 47.953934, 47.9645648, 48.0265022, 48.0249268, 47.951652, 48.0254267, 47.9690977, 49.3527739, 49.3451308, 50.7696567, 50.767339, 50.7534204, 50.775435, 50.759712, 50.7630917, 50.7589791, 50.7768211, 53.4804957, 53.4305316, 56.1890333, 56.1694433, 56.1409132, 56.1523421, 56.1788844, 56.1737424, 56.1204063, 56.1047533, 56.1247003, 56.1293441, 56.1116698, 56.1414604, 57.3100886, 57.2900195, 57.3443251, 58.5790726, 59.776073, 59.6932115, 59.7322922, 59.7565171, 59.7780871, 59.7480857, 60.7950622, 63.1008595, 62.9985082, 63.0248589, 63.0549046, 64.3303572, 64.3385164, 64.3284183, 68.3025694, 68.2875625, 69.6675595, 69.6902716, 70.5738979, 70.6244101, 70.5820796, 71.4233821, 78.2565978, 81.2923646, 99.9321581]


Average run time: 49.985901578 seconds
100 completions in about 100s is amazing for the total result. But the individual completion is too slow. I wonder if tabbyAPI (with sequential requests) and multiple parallel workers might actually be better. I expect no more than 2-3 requests at the same time for now. And no more than 5 for a while (assumptions can be wrong of course).
rien19ma
rien19maβ€’8mo ago
Hi @Papa Madiator , This is a weird question, but what if I'm too GPU-poor on my laptop to handle the build and a test of a docker image (model included) for a llama3 70B quantized model, for example? (not exactly picked randomly; it's currently my need :)) I'm ending up popping an instance on GCP to build and test my image before pushing it to Docker Hub, making it accessible for Rundod. Is this the correct workflow? Is there any chance we can exit the running container from inside the POD and reach the underlying Linux where we can docker build . [...] && docker push?
nerdylive
nerdyliveβ€’8mo ago
GPU-poor? meaning your hardware GPU requirement isnt enough to build docker image? building doesnt need gpu power except you run the container locally also another alternative you can also test on runpod too No you can't get access to the main linux host
rien19ma
rien19maβ€’8mo ago
Hi, Thanks for your prompt answer. Yes, "GPU-poor" because my laptop has a really bad consumer-class GPU. You might need GPU access even during build time if, for example, you download a model from HF with vllm. But even if you don't have the need at build time, I will need it at run time for the test, and as mentioned earlier, unfortunately, I can't do that locally. So, what is the workflow you advised? I'm not sure I fully grasp your second point: also another alternative you can also test on runpod too ? Thanks a lot πŸ™‚
nerdylive
nerdyliveβ€’8mo ago
well you build the image, then deploy it to runpod that whats i meant haha
rien19ma
rien19maβ€’8mo ago
ahah
nerdylive
nerdyliveβ€’8mo ago
not really effective i guess but for using the gpu i think thats the only way? downloading with vllm needs gpu huh? never knew that does that works for you?
rien19ma
rien19maβ€’8mo ago
I will rephrase it to be more accurate. I'm using the HF tool to download the model within my image (AutoTokenizer.from_pretrained, AutoModelForCausalLM.from_pretrained), and they raised an error if they can't find the GPU. To be more specific, it's actually quantization_config.py, which complains about not finding a GPU for AWQ. I don't think the GPU is involved at any point during downloading; I guess it's just a check to avoid uncesseray model downloading if you don't have a GPU. But I admit I have been lazy, and I will find a way to overcome that point (by downloading the model with another tool) ^^
nerdylive
nerdyliveβ€’8mo ago
Oh yeah correct if you're using that code it checks for gpu try use git on hf maybe? i have no experience for downloading huggingface models without the code sorry πŸ™‚
rien19ma
rien19maβ€’8mo ago
Thanks for the advice. I will give it a try. Just to summarize my understanding of the workflow: 1. Build the image locally (in my case: vllm, model (50go) and runpod required handler) 2. push it to dockerhub 3. deploy a pod/serverless on runpod and give it a try 4. if the output is not as expected, restart from Step 1 M'I correct?
nerdylive
nerdyliveβ€’8mo ago
Yep yep like that wow nice formatting
justin
justinβ€’8mo ago
Personally, what I would do just as advice: https://discord.com/channels/912829806415085598/1194695853026328626 1) Start up a Pytorch Template from Runpod with a GPU pod 2) Manually go through the steps and record it all on the side 3) From there, becomes much easier to confirm if it works 4) Then push to dockerhub 5) Then confirm that your template works by downloading it on the Pod The reason why is b/c then you can use the pytorch template as a base, and it comes with a lot of stuff in the background https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless Can look to how I did it here, I even modified it, so that I overwrite Runpod's start.sh script, so that I can launch jupyter server without a password Otherwise for modified templates with a jupyter notebook server, it will usually ask you for a token, which is a bit annoying.
rien19ma
rien19maβ€’8mo ago
Is pushing large-size images (> 10go) on Dockerhub painful (really slow) just for me (and my bad upload internet connection), or is it a pain in the workflow for everyone?
nerdylive
nerdyliveβ€’8mo ago
Oof Hey try building in docker cloud or depot dev Docker cloud build search that up Or github runners if possible too
rien19ma
rien19maβ€’8mo ago
Thanks, Justin, I checked your repository before asking my question here. For context, I need to create a working Docker image containing a llama3 70B quantized model. Currently (as other users mentioned in this channel), it's not financially viable due to the delay time (around 120s) using the Vllm template. This implies at 0.0025 $/s a 0.3$ cost to pop the serverless endpoint. On the runpod doc and on repositories affiliated with runpod, it's mentioned that including the model inside the image reduces the delay time (some users on r/LocalLLama claim impressive figures using this technique). To do so and to test if it works, I was renting an instance on GCP (A100) to build and test my docker image, and I came here to ask if it was the right workflow πŸ™‚ All of that to say, I'm not sure how your repository (and your workflow) will fit my need for my usecase; wdyt ? Thanks πŸ™‚ Thanks πŸ™‚
justin
justinβ€’8mo ago
Showing my repository to show you the Dockerfile haha. Just wondering, why not use like an API by anthropic or mixtral - or are you in prod and why you're looking to optimize costs by self hosting
rien19ma
rien19maβ€’8mo ago
Ahaha thanks actually I used some part of your dockerfile πŸ™‚
justin
justinβ€’8mo ago
I think the repository I use can also use Llama3, even though not on the github readme I was reading the OpenLLM issues some people say that you can use the llama3 with a vllm backend Obvs my repo is different than an officially supported runpod image tho but just cause i like my methodology more Also yes, including a llama3 model inside the image is much faster the issue is that im not sure a 70B model is able to load easily once images get to that size, to spin up docker images / build it becomes painful Why I think using an API for LLMs tends to be better since they solve a lot of problems Like ex. a llama2 70b model is above 100gb: https://hub.docker.com/layers/justinwlin/llama2_70b_openllm/latest/images/sha256-c481f8dd51481daf995bfbe7075ee44bf0e1dc680cbda767cc97fcb7fd83f5a4?context=repo at least when i tried to build it This is a painful initialization time, so better for models this big to be stored in a network volume, if you really want to self-host your own llm
rien19ma
rien19maβ€’8mo ago
The idea is to validate that a specific 70B llama3 quantized is good enough for our use case (which implies non-technical teams playing with the model) before using it with Vllm on batching mode (therefore no serverless), with millions of prompts to pass every X.
justin
justinβ€’8mo ago
Got it You can probably just: 1) Start up a Pytorch template with a network volume attach 2) Download the 70B model from that GPU Pod (or use a CPU Pod since this is just for pure downloading and not GPU processing) 3) in the future, you can always launch new pods all attached to the network volume and they will all have access to the 70B model πŸ™‚ that is what I would do Make sure that whatever region you are using for network volume, has the correct GPU Pod that you want to test with avaliable
rien19ma
rien19maβ€’8mo ago
I haven't try yet, but your solution just above was also recommended here: https://www.reddit.com/r/SillyTavernAI/comments/1app7gv/new_guides_and_worker_container_for/
Reddit
From the SillyTavernAI community on Reddit: New guides and worker c...
Explore this post and more from the SillyTavernAI community
justin
justinβ€’8mo ago
ooo nice
rien19ma
rien19maβ€’8mo ago
They also mentioned that even with this solution, they still have some delay time (not 120s of, but still). Some mention ms of delay time when the model is included, but you might be right; size matters...
rien19ma
rien19maβ€’8mo ago
Do you remember the ~build time for that ?
No description
nerdylive
nerdyliveβ€’8mo ago
No problem
justin
justinβ€’8mo ago
I don't know, I use something called depot.dev though for it, and had a 500gb cache to build it lol. Not sure it worth it though, I wouldn't do it again for any images > 35GB size
justin
justinβ€’8mo ago
Depot
Depot
The fastest way to build Docker images.
justin
justinβ€’8mo ago
I really think that's why for anything with more complex tasks for LLMs, I just use paid API services, and go through a finetuning process, or I just use the avaliable zeroshot model It just isn't worth self-hosting an LLM unless I really dial it in, and it needs to be for a smaller model like a mistral7b
rien19ma
rien19maβ€’8mo ago
Thanks for the feedback πŸ™‚
!x.com/dominicfrei
!x.com/dominicfreiOPβ€’8mo ago
That's what I'm currently considering. Building the image myself using tabbyAPI and https://huggingface.co/hjhj3168/Llama-3-8b-Orthogonalized-exl2 I wish there was anyone out there, offering a service that's running https://huggingface.co/hjhj3168/Llama-3-8b-Orthogonalized-exl2 or https://huggingface.co/Undi95/Llama3-Unholy-8B-OAS but it seems like I have to host them myself. πŸ˜„ Like any decent unceonsred RP capable model it seems. Or would you know of any?
nerdylive
nerdyliveβ€’8mo ago
I've ever heard some websites offering hosted models for Roleplay try finding some AI chat Roleplay websites
Want results from more Discord servers?
Add your server