Topics

RunPod•12mo ago

!x.com/dominicfrei

Maintenance - only a Community Cloud issue?

Hey there! I just started a new pod and noticed this maintenance window. Is this only a thing on community cloud or also on secure cloud? I'm trying to find a way to host a model 24/7 for a chat bot and a three day maintenance window ain't gonna cut it. 🙂 Looking forward to feedback. 🫶

No description

215 Replies

Jason•12mo ago

hello wait, yeah these are on both secure and community cloud i think

!x.com/dominicfreiOP•12mo ago

Really? But how can you run a production system with a three day downtime? 😄

Jason•12mo ago

i dont know what to suggest, those are mandatory right. but from what i think if its possible try to create another pod for HA

!x.com/dominicfreiOP•12mo ago

In what way mandatory? 🙂 I feel stupid, but HA? 🙂

Jason•12mo ago

Sorry high availability like creating some other pods for backup something like that

Jason•12mo ago

its not that frequent ig: https://docs.runpod.io/hosting/maintenance-and-reliability

Maintenance and reliability | RunPod Documentation

Maintenance

Jason•12mo ago

like software upgrades, server maintanance

!x.com/dominicfreiOP•12mo ago

But, why that? One pod alone is already as expensive as the GCP server I'm running. And GCP doesn't shut it down for maintenance. 😄 I wanted to switch because runpod is more convenient and I prefer to support smaller services instead of the big clouds. But if I end up paying 2x what I pay for GCP, then it doesn't make sense. 😄 But I don't want that. Never touch a running system. I don't want someone to fumble with the server that works. 😄 That's why I'm so confused. You don't just update a server just to update a server. 🙂

Jason•12mo ago

Whaat 2x?

!x.com/dominicfreiOP•12mo ago

Yup.

Jason•12mo ago

really how is that possible

!x.com/dominicfreiOP•12mo ago

My GCP costs me $250 per month for a 24/7 uptime. I've deployed Ollama with llama2 there and it runs just fine. The smallest community cloud here would be lower than that ($0.14/h) but even the smallest secure cloud would be the same price ($0.34/h == $250/M). So, if i need two, that's $500 each month.

Jason•12mo ago

yeah i cant do something about that, but if you're unhappy with the maintanances i guess you can ask support in webchat or from email about those

!x.com/dominicfreiOP•12mo ago

Ah, don't worry. It wasn't a complaint, just confused. 🙂 I gotta send them a message I guess. Maybe there is a way to opt out?

Jason•12mo ago

yeah that makes me wonder too yeah sure, opt out like what?

!x.com/dominicfreiOP•12mo ago

Opt out of touching the server I pay for. 😄

Jason•12mo ago

like stopping it? or removing it on maintanance it wont be billed too btw

!x.com/dominicfreiOP•12mo ago

Well, that might be true. But I would be loosing users and that costs me money. 😉 I need 24/7 uptime. I'm running a chatbot that's used worldwide.

Jason•12mo ago

i see yeah that uptime is important try the webchat sir, might give you other info about this

!x.com/dominicfreiOP•12mo ago

Yup, did that right away when you said it. I didn't see there is one. 😄 They are offline at the moment, so we'll have to wait for an answer. 🙂

Jason•12mo ago

ah

Madiator2011•12mo ago

maintance is for both secure and community sometimes we need update drivers etc

J.•12mo ago

I think your best bet is to use serverless? Serverless with a minimum of like ex. 2 active workers, and max of 10 workers that way you can also scale as necessary + also that way pods will autocycle if something goes wrong

!x.com/dominicfreiOP•12mo ago

What are active workers in that context? Continuously incurring cost or only when I actually call the endpoint? Is # workers == # parallel calls possible?

J.•12mo ago

Continuously occuring cost, but at a 40% discount. Yes, so u can have 0 min workers, 10 max workers if u only wanna pay for up time just there will be cold start times To spin up a gpu etc

!x.com/dominicfreiOP•12mo ago

Yeah, that's the issue. I need answers within seconds. 🙂

J.•12mo ago

Yea so u can have min worker of 1 or 2 with a max of 10 for ex That way u always have something running

!x.com/dominicfreiOP•12mo ago

But because of cold starts, they woulnd't be ready in time I assume?

J.•12mo ago

Parallel function calls to ur handler is possible can read their doc for example but u can have a concurrency modifier for how many requests a worker can handle. need to be careful not to blow up the gpu memory tho https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless/blob/main/handler.py

GitHub

Runpod-OpenLLM-Pod-and-Serverless/handler.py at main · justinwlin/R...

A repo for OpenLLM to run pod. Contribute to justinwlin/Runpod-OpenLLM-Pod-and-Serverless development by creating an account on GitHub.

J.•12mo ago

On first call would be slow, but u can adjust an “idle time” a worker sits in before shutting down so maybe the first call is slow for new workers but it can stay online for X seconds or minutes however u define and avoid more cold start Ud only be paying cold start if u exceed ur minimum workers

!x.com/dominicfreiOP•12mo ago

That might be an idea.

J.•12mo ago

And new workers need to spin up to go active

J.•12mo ago

https://github.com/ashleykleynhans/runpod-api

GitHub

GitHub - ashleykleynhans/runpod-api: A collection of Python scripts...

A collection of Python scripts for calling the RunPod GraphQL API - ashleykleynhans/runpod-api

J.•12mo ago

If u want to get rlly into it u can dynamically set the minimum workers on an endpoint dynamically using a different server

!x.com/dominicfreiOP•12mo ago

Yeah, I guess keeping two workers active, if at 40% cost, would still be cheaper than an on-demand server.

J.•12mo ago

https://www.runpod.io/serverless-gpu

Serverless GPUs for AI Inference and Training

Serverless GPUs to deploy your ML models to production without worrying about infrastructure or scale.

!x.com/dominicfreiOP•12mo ago

I could do it time based depending on when most of my users are online. 🤔

J.•12mo ago

their pricing for serverless Yeah If u have some sort of backend keeping track of stuff im sure u could get away with one active and if u see an increase useds *users scale the minimum workers to 2 or 3 and have those do parallel concurrency requests

!x.com/dominicfreiOP•12mo ago

Do you think those cold boot times are realistic?

StableDiffusion
210ms
271ms

StableDiffusion
210ms
271ms

J.•12mo ago

it rlly rlly depends For production workload, they have a cache mechanism called fastboot where supposedly i heard someone else get pretty good times i guess their fastboot is just to load the model into memory and stuff faster basically if u have an active worker, more max workers, more requests coming in ive heard these are more ideal for the caching mechanism obvs i have never hit such a production workload cause i just use it for one off tasks and projects i have running

!x.com/dominicfreiOP•12mo ago

Is a worker a different pod or are all workers running on the same pod?

J.•12mo ago

a worker is a different pod

!x.com/dominicfreiOP•12mo ago

I'm runing a chat bot with 100s of users and I do have parallel access there already.

!x.com/dominicfreiOP•12mo ago

https://t.me/your_ai_bots_channel

Telegram

This group contains a collection of AI companions that you can use 24/7. For questions, feedback and anything else, please message @ai_companions_admin :)

J.•12mo ago

I mean, u can just modify the concurrency modifier then like my code shows and have a worker take on more requests in parallel this shows it and also another thread https://discord.com/channels/912829806415085598/1200525738449846342 with a more primitive example when i was trying to initially get it working before

!x.com/dominicfreiOP•12mo ago

Oh nice, I have to look into that.

J.•12mo ago

yea prob i just never wrote code to dynamically max out based off of gpu memory which is what i think the best way to handle it is but u could if u knew for ex a normal safe amt just hard code it

!x.com/dominicfreiOP•12mo ago

Yeah, GPU mem is a thing ... but if each worker has its own pod, that should be fine?

J.•12mo ago

the concurrency modifier supposedly can be dynamically modified yea i guess depends how ur code handles ur gpu memory getting blown up lol i think most gpus just crash the system but some libraries prevents it from happening i had some instances where im running like three llms and i blew up the memory

!x.com/dominicfreiOP•12mo ago

hmm

J.•12mo ago

i mean it sounds like if ur doing it already with a pod it isnt an issue at the end the day, ur right its just a pod

!x.com/dominicfreiOP•12mo ago

My production system is on GCP at the moment, but I'm playing around with runpod. Running one pod only though, on-demand.

J.•12mo ago

got it Another option is fly.io to also look into but i do think runpod’s serverless offering is what i like the most and ease of testing on a pod

!x.com/dominicfreiOP•12mo ago

Never heard of that, I gotta look into that. 🙂

J.•12mo ago

but fly.io might also be a good gpu alternative provider they also do dynamic scaling on requests i think? at least for their normal cpu stuff havent tried their gpu offering yet also more expensive than runpod, prob cheaper than gcp But more established i think fly.io is

!x.com/dominicfreiOP•12mo ago

https://fly.io/docs/about/pricing/ Phew, that's expensive. 😄

Fly

Fly.io Resource Pricing

Documentation and guides from the team at Fly.io.

No description

!x.com/dominicfreiOP•12mo ago

Are those the smallest GPUs they got? You'd be surprised. 😄

J.•12mo ago

yea xD why i havent moved to them yet

!x.com/dominicfreiOP•12mo ago

I pay $260 per month for GCP, but it's 24/7. I actually haven't found a cheapeer alternative.

J.•12mo ago

runpod prob is the best i think in terms cost but hmmm, has its quirks dang what kinda gpu is running for gcp to get that low 👁️

!x.com/dominicfreiOP•12mo ago

Look at the secure cloud on-demand offer on runpod, same price. 😉 NVIdia T4, has been fine so far. The GPU is funnily enough not my problem so far. 😄

J.•12mo ago

yeaaa lol. i guess cause i use serverless

!x.com/dominicfreiOP•12mo ago

It's the model, I want to upgrade to llama3 and in doing so I started looking into the easiest way to run it.

J.•12mo ago

that rlly where i think runpod’s best offering rn is as a differentiator

!x.com/dominicfreiOP•12mo ago

That's how I came to runpiod. I have to look into serverless I guess. 😄

J.•12mo ago

my cost rn i keep much lower than other providers since i dont need to pay 24/7 but if u got users, and only doing 260 a month that pretty good

!x.com/dominicfreiOP•12mo ago

Yeah, I need 24/7, there is no other way.

J.•12mo ago

yea, i think try with serverless, min: 1, max: 10 maybe stress test or shadow mode run it could be worth it 🤷 gl gl!

!x.com/dominicfreiOP•12mo ago

Thank you so much for all that input! ❤️ 🫶

!x.com/dominicfreiOP•12mo ago

How does it work though? When I try to use this vLLM quick deploy, there are not templates I can choose from. The list is just empty. Same when I try 'New Endpoint' - am I dumb or is the site buggy? 😄

No description

No description

Jason•12mo ago

its optional right just fill in what is required or what you need to configure then

!x.com/dominicfreiOP•12mo ago

kk, lemme try

Jason•12mo ago

alright did it work

!x.com/dominicfreiOP•12mo ago

Nah, no clue how this works. 😄 I've deployed something (https://huggingface.co/Undi95/Llama3-Unholy-8B-OAS - which works with vLLM) and then tried this request tab and the request is just stuck, sitting there but all 5 workers running, happyly ticking away the money. 😄

Undi95/Llama3-Unholy-8B-OAS · Hugging Face

No description

No description

Jason•12mo ago

whaat cancel the request bro who this works? what do you mean by that?

!x.com/dominicfreiOP•12mo ago

Sorry, 'how'. How this works. I don't know how this serverless interface works.

Jason•12mo ago

well i think vllm works with openai's library wait leme find the docs

Jason•12mo ago

https://docs.runpod.io/serverless/workers/vllm/openai-compatibility#initialize-your-project

OpenAI compatibility | RunPod Documentation

The vLLM Worker is compatible with OpenAI's API, so you can use the same code to interact with the vLLM Worker as you would with OpenAI's API.

Jason•12mo ago

try using that

!x.com/dominicfreiOP•12mo ago

That's what I assumed the request on the website does so that I don't even have to write code. 😄 IF that already doesn't work, do you expect my code to work?

Jason•12mo ago

wait did you setup network storage with that? i think its downloading the model if not wrong? it needs to store them in network storage

!x.com/dominicfreiOP•12mo ago

I guess there is none by default. Funny. D:

No description

!x.com/dominicfreiOP•12mo ago

Yeah, you're right of course. That won't work. Lemme add some.

Jason•12mo ago

hmm yeah maybe default they should recommend setup network volume when creating vllm

!x.com/dominicfreiOP•12mo ago

And then, I have no clue which data center to choose because they have different GPUs there, which I don't even understand how that's connected.

No description

!x.com/dominicfreiOP•12mo ago

My workders have all kinds of GPUs.

Jason•12mo ago

yeah nice pick any

!x.com/dominicfreiOP•12mo ago

No description

No description

No description

No description

No description

Jason•12mo ago

btw i'd suggest cancelling the request and trashing the running workers first

!x.com/dominicfreiOP•12mo ago

I'll just pick the US one because they've probably got most there.

Jason•12mo ago

okay ok any works

!x.com/dominicfreiOP•12mo ago

cancelled. 👍 How can I trash the workers?

Jason•12mo ago

clikc on the green one its like a button then theres trash logo press it and confirm

curl https://api.runpod.ai/v2/${YOUR_ENDPOINT_ID}/openai/v1/chat/completions \
    -H "Content-Type: application/json" \
        -H "Authorization: ${RUNPOD_API_KEY}" \
    -d '{
        "model": "openchat/openchat-3.5-1210",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "Reply with: Hello, World!"
            }
        ]
    }'

curl https://api.runpod.ai/v2/${YOUR_ENDPOINT_ID}/openai/v1/chat/completions \
    -H "Content-Type: application/json" \
        -H "Authorization: ${RUNPOD_API_KEY}" \
    -d '{
        "model": "openchat/openchat-3.5-1210",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant."
            },
            {
                "role": "user",
                "content": "Reply with: Hello, World!"
            }
        ]
    }'

!x.com/dominicfreiOP•12mo ago

Ah, found it.

No description

Jason•12mo ago

heres if you wanna test using curl later but i'd suggest use the python code from the docs yes

!x.com/dominicfreiOP•12mo ago

The workers automatically get moved into stale actually and new ones initialize. So at least that one is done for me. 😄

Jason•12mo ago

ok after you setup the network volume try sending 1 request first, and see the log

!x.com/dominicfreiOP•12mo ago

Waiting for them to come up now.

Jason•12mo ago

yeh great they moving region i think

!x.com/dominicfreiOP•12mo ago

I assume so too.

Jason•12mo ago

it should be downloading first, not sure never used vllm template before hahah

!x.com/dominicfreiOP•12mo ago

I ended up clicking Romania, because this one had the A4500 in it but since the workers get reset, that actually didn't matter. I could have chosen any data center.

Jason•12mo ago

later after 1 running please send me the logs i want to see them too yeaah

!x.com/dominicfreiOP•12mo ago

It has too, yeah. That model is not just randomly available. 😄

Jason•12mo ago

which do you prefer or just test around the gpus what gpu did you use in gcp btw?

!x.com/dominicfreiOP•12mo ago

NVidia T4, the smallest one they got.

Jason•12mo ago

Oh is that like 16gb ram here?

!x.com/dominicfreiOP•12mo ago

Yes, iirc.

!x.com/dominicfreiOP•12mo ago

hahahah

No description

!x.com/dominicfreiOP•12mo ago

I created a 30GB volume, not sure how that can be.

Jason•12mo ago

oof so fast

!x.com/dominicfreiOP•12mo ago

https://huggingface.co/Undi95/Llama3-Unholy-8B-OAS/tree/main?not-for-all-audiences=true

Undi95/Llama3-Unholy-8B-OAS at main

Jason•12mo ago

how big is your model hahah

!x.com/dominicfreiOP•12mo ago

The model is like 5GB.

Jason•12mo ago

hmm im clueless about what the template puts in there but if you want to check it out you can create a gpu pod (not necessary) or checkout the template's github

!x.com/dominicfreiOP•12mo ago

Yeah, not a fan of intransparent templates. Had I just booted up a vanilla Ubuntu on AWS, I would already be done. 😄

Jason•12mo ago

https://github.com/runpod-workers/worker-vllm

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Jason•12mo ago

no really, this doesnt look like 5gb's

No description

Jason•12mo ago

isnt it true that you need to download all of em to get them to working?

!x.com/dominicfreiOP•12mo ago

You might be right. I'm used to using models where they got one file per model version. But this is zipped up, you need all the files. Ok, 30GB ain't gonna cut it.

Jason•12mo ago

looks like 32gb+ 🤣

!x.com/dominicfreiOP•12mo ago

I need to add like a 100GB volume or so, because I'm sure unpacked it's even bigger.

Jason•12mo ago

nah nah try 40 first

!x.com/dominicfreiOP•12mo ago

Alright, let's try again.. haha

Jason•12mo ago

later if it runs out of space you can scale it up again hahah Brb

!x.com/dominicfreiOP•12mo ago

Now we're stuck in initializing. 😄

No description

!x.com/dominicfreiOP•12mo ago

btw, I've shared this convo in the bloke discord where I got the pointer to come over here. We're trying to figure out runpod over there: https://discord.com/channels/1111983596572520458/1112353569077735595 😄 It's still in init. I'll just create a fresh endpoint, this time with the correct config. 😄

!x.com/dominicfreiOP•12mo ago

Ah, there is a container disk config right there actually. I've set that to 50 now.

No description

!x.com/dominicfreiOP•12mo ago

Ah wait, that's the blank template ... I'll try vLLM of course.

!x.com/dominicfreiOP•12mo ago

That's the right one. Default there is 30GB, just shy of the correct size. 😄

No description

!x.com/dominicfreiOP•12mo ago

I've set it to 50 GB now.

Jason•12mo ago

I'm not in server yet so I can't see that channel link Okok

!x.com/dominicfreiOP•12mo ago

I'm basically doing the same in Python:

import requests

url: str = "https://api.runpod.ai/v2/vllm-zeul7en2ikchq9/openai/v1/chat/completions"

data = {"model": "Undi95/Llama3-Unholy-8B-OAS",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]}

headers = {
    "Authorization": "Bearer XXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "Content-Type": "application/json"
}

response = requests.post(url, json=data, headers=headers)

print(response.json())

import requests

url: str = "https://api.runpod.ai/v2/vllm-zeul7en2ikchq9/openai/v1/chat/completions"

data = {"model": "Undi95/Llama3-Unholy-8B-OAS",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]}

headers = {
    "Authorization": "Bearer XXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
    "Content-Type": "application/json"
}

response = requests.post(url, json=data, headers=headers)

print(response.json())

Not much success though:

Connected to pydev debugger (build 233.15026.15)
{'error': 'Error processing the request'}

Process finished with exit code 0

Connected to pydev debugger (build 233.15026.15)
{'error': 'Error processing the request'}

Process finished with exit code 0

Checking the logs now.

!x.com/dominicfreiOP•12mo ago

No description

!x.com/dominicfreiOP•12mo ago

btw: The cold start time is of course useless. 😄 90s haha

No description

Jason•12mo ago

Woah the logs so blurry pls copy and paste the text instead Maybe only the first time boot ah What's the logs like

!x.com/dominicfreiOP•12mo ago

wow, that screenshot is uselss. I wonder why

2024-05-04 22:37:23.829
[shns1kb4sswpl4]
[error]
Job has missing field(s): input.
2024-05-04 22:37:23.829
[shns1kb4sswpl4]
[info]
4--- Starting Serverless Worker | Version 1.6.2 ---
2024-05-04 22:35:07.167
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
9INFO 05-04 13:35:06 serving_chat.py:298] ' }}{% endif %}
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
+INFO 05-04 13:35:06 serving_chat.py:298]
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
INFO 05-04 13:35:06 serving_chat.py:298] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
+INFO 05-04 13:35:06 serving_chat.py:298]
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
�INFO 05-04 13:35:06 serving_chat.py:298] {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
HINFO 05-04 13:35:06 serving_chat.py:298] Using supplied chat template:
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:35:06.126
[shns1kb4sswpl4]
[info]
MINFO 05-04 13:35:06 model_runner.py:756] Graph capturing finished in 6 secs.
2024-05-04 22:35:00.381
[shns1kb4sswpl4]
[info]
INFO 05-04 13:35:00 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-05-04 22:35:00.381
[shns1kb4sswpl4]
[info]
�INFO 05-04 13:35:00 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-05-04 22:34:59.115
[shns1kb4sswpl4]
[info]
NINFO 05-04 13:34:59 llm_engine.py:357] # GPU blocks: 1021, # CPU blocks: 2048
2024-05-04 22:34:56.786
[shns1kb4sswpl4]
[info]
HINFO 05-04 13:34:56 weight_utils.py:257] Loading safetensors took 3.75s
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
VINFO 05-04 13:33:53 weight_utils.py:163] Using model weights format ['*.safetensors']
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
�INFO 05-04 13:33:49 llm_engine.py:87] Initializing an LLM engine with config: model='Undi95/Llama3-Unholy-8B-OAS', tokenizer='Undi95/Llama3-Unholy-8B-OAS', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

2024-05-04 22:37:23.829
[shns1kb4sswpl4]
[error]
Job has missing field(s): input.
2024-05-04 22:37:23.829
[shns1kb4sswpl4]
[info]
4--- Starting Serverless Worker | Version 1.6.2 ---
2024-05-04 22:35:07.167
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
9INFO 05-04 13:35:06 serving_chat.py:298] ' }}{% endif %}
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
+INFO 05-04 13:35:06 serving_chat.py:298]
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
INFO 05-04 13:35:06 serving_chat.py:298] '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
+INFO 05-04 13:35:06 serving_chat.py:298]
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
�INFO 05-04 13:35:06 serving_chat.py:298] {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
HINFO 05-04 13:35:06 serving_chat.py:298] Using supplied chat template:
2024-05-04 22:35:06.656
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:35:06.126
[shns1kb4sswpl4]
[info]
MINFO 05-04 13:35:06 model_runner.py:756] Graph capturing finished in 6 secs.
2024-05-04 22:35:00.381
[shns1kb4sswpl4]
[info]
INFO 05-04 13:35:00 model_runner.py:688] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
2024-05-04 22:35:00.381
[shns1kb4sswpl4]
[info]
�INFO 05-04 13:35:00 model_runner.py:684] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
2024-05-04 22:34:59.115
[shns1kb4sswpl4]
[info]
NINFO 05-04 13:34:59 llm_engine.py:357] # GPU blocks: 1021, # CPU blocks: 2048
2024-05-04 22:34:56.786
[shns1kb4sswpl4]
[info]
HINFO 05-04 13:34:56 weight_utils.py:257] Loading safetensors took 3.75s
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
VINFO 05-04 13:33:53 weight_utils.py:163] Using model weights format ['*.safetensors']
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
�INFO 05-04 13:33:49 llm_engine.py:87] Initializing an LLM engine with config: model='Undi95/Llama3-Unholy-8B-OAS', tokenizer='Undi95/Llama3-Unholy-8B-OAS', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir='/runpod-volume/huggingface-cache/hub', load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
2024-05-04 22:33:56.724
[shns1kb4sswpl4]
[info]
vSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

!x.com/dominicfreiOP•12mo ago

I thought I'll try it through the request interface as well, but also, no success.

No description

Jason•12mo ago

I think the right way for vllm is from the openai endpoints Not the run or runsync

!x.com/dominicfreiOP•12mo ago

Ohhh, I didn't realize 'run' is the 'run' endpoint, not just 'run the query'. Bad naming. 😄 What's the difference?

Jason•12mo ago

What's next

!x.com/dominicfreiOP•12mo ago

That's it.

No description

Jason•12mo ago

Isn't there a running worker You can press it and press logs

!x.com/dominicfreiOP•12mo ago

Ha, indeed. That UI is... not easy to navigate. 😄

Jason•12mo ago

Hmm I don't get what do you mean but it's should be handled by the worker, and the openai support is kinda new thing so have to check on the docs The logs kinda bad yeah

!x.com/dominicfreiOP•12mo ago

Just started another run via Python:

2024-05-04T13:37:23.829697879Z --- Starting Serverless Worker |  Version 1.6.2 ---
2024-05-04T13:37:23.829747990Z {"requestId": null, "message": "Job has missing field(s): input.", "level": "ERROR"}
2024-05-04T13:38:34.395918737Z {"requestId": null, "message": "Job has missing field(s): input.", "level": "ERROR"}
2024-05-04T13:39:22.656474814Z {"requestId": "sync-aed84f4f-98b5-4829-90b6-88c831a1528d-e1", "message": "Finished running generator.", "level": "INFO"}
2024-05-04T13:39:22.719179137Z {"requestId": "sync-aed84f4f-98b5-4829-90b6-88c831a1528d-e1", "message": "Finished.", "level": "INFO"}
2024-05-04T13:41:22.288701791Z {"requestId": null, "message": "Failed to get job, status code: 502", "level": "ERROR"}

2024-05-04T13:37:23.829697879Z --- Starting Serverless Worker |  Version 1.6.2 ---
2024-05-04T13:37:23.829747990Z {"requestId": null, "message": "Job has missing field(s): input.", "level": "ERROR"}
2024-05-04T13:38:34.395918737Z {"requestId": null, "message": "Job has missing field(s): input.", "level": "ERROR"}
2024-05-04T13:39:22.656474814Z {"requestId": "sync-aed84f4f-98b5-4829-90b6-88c831a1528d-e1", "message": "Finished running generator.", "level": "INFO"}
2024-05-04T13:39:22.719179137Z {"requestId": "sync-aed84f4f-98b5-4829-90b6-88c831a1528d-e1", "message": "Finished.", "level": "INFO"}
2024-05-04T13:41:22.288701791Z {"requestId": null, "message": "Failed to get job, status code: 502", "level": "ERROR"}

Only the last line is the latest request. HTTP 502 And super helpful logs.

Jason•12mo ago

Kek A little searching helps Thejob has missing fields input thingy is from the web's run request right? Not from your python file?

!x.com/dominicfreiOP•12mo ago

No, it's my python request. Only the last one

Jason•12mo ago

Or else the inputs the problem

!x.com/dominicfreiOP•12mo ago

at 13:41 There is no input. 😄

Jason•12mo ago

That's why hahah

!x.com/dominicfreiOP•12mo ago

import requests

url: str = "https://api.runpod.ai/v2/vllm-zeul7en2ikchq9/openai/v1/chat/completions"

data = {"model": "Undi95/Llama3-Unholy-8B-OAS",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]}

headers = {
    "Authorization": "Bearer XXXXXXXXXXXXXXXXXXXXXXXX",
    "Content-Type": "application/json"
}

response = requests.post(url, json=data, headers=headers)

print(response.json())

import requests

url: str = "https://api.runpod.ai/v2/vllm-zeul7en2ikchq9/openai/v1/chat/completions"

data = {"model": "Undi95/Llama3-Unholy-8B-OAS",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]}

headers = {
    "Authorization": "Bearer XXXXXXXXXXXXXXXXXXXXXXXX",
    "Content-Type": "application/json"
}

response = requests.post(url, json=data, headers=headers)

print(response.json())

Jason•12mo ago

Try input then

!x.com/dominicfreiOP•12mo ago

That's what I did.

Madiator2011•12mo ago

https://tenor.com/view/popcorn-gif-14980450905264007240

Tenor

Jason•12mo ago

Try use openais package instead bro

!x.com/dominicfreiOP•12mo ago

Hey Daddy!

Jason•12mo ago

Hello papa madiator

Madiator2011•12mo ago

https://tenor.com/view/thats-okay-no-problem-dont-worry-ok-fine-then-gif-26762848

Tenor

!x.com/dominicfreiOP•12mo ago

C:\Users\busin\Documents\GitHub\ai_girlfriend\src\.venv\Scripts\python.exe C:\Users\busin\Documents\GitHub\ai_girlfriend\playground\playground.py 
 I'm not familiar with RunPod, and I don't have enough information to determine whether it's the best platform or not. Can you please provide more context or details about RunPod and what it does? This will help me better understand your question and provide a more accurate response.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I see what's going on here! As a conversational AI, I don't have any prior knowledge about RunPod, and I'm not biased towards any particular platform. I'm here

Process finished with exit code 0

C:\Users\busin\Documents\GitHub\ai_girlfriend\src\.venv\Scripts\python.exe C:\Users\busin\Documents\GitHub\ai_girlfriend\playground\playground.py 
 I'm not familiar with RunPod, and I don't have enough information to determine whether it's the best platform or not. Can you please provide more context or details about RunPod and what it does? This will help me better understand your question and provide a more accurate response.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I see what's going on here! As a conversational AI, I don't have any prior knowledge about RunPod, and I'm not biased towards any particular platform. I'm here

Process finished with exit code 0

Fascinating. 😄

Jason•12mo ago

Ah it works or what Bruh local llm?

!x.com/dominicfreiOP•12mo ago

It does. Now I have to figure out what it costs and if it's fast enough. 😄 No lol. That's the serverless endpoint.

Jason•12mo ago

Ohh Finnaly

Madiator2011•12mo ago

https://tenor.com/view/running-barry-allen-the-flash-ezra-miller-the-flash-movie-gif-6124209952182402217

Tenor

Jason•12mo ago

Use the openai package bro *3

!x.com/dominicfreiOP•12mo ago

The code is just in a file called playground.py.

Jason•12mo ago

Ohhh

!x.com/dominicfreiOP•12mo ago

And more importantly, how it handles paralle requests. And how to use the model I want to use, which is not compatible with vLLM actually. 😄

Jason•12mo ago

Sure try use threads hahahah or callbacks if there is any Use the openai package 🥦 Custom handler 🙏Well a little bit coding is fun Goodluck

Wolfsauge•12mo ago

gz on the test setup, and thanks for linking @!x.com/dominicfrei i hope you will write a nice markdown with the example setup commands/screenshots

!x.com/dominicfreiOP•12mo ago

Well, we're not quite there yet. 😉

!x.com/dominicfreiOP•12mo ago

@nerdylive @Wolfsauge I've now added that serverless vLLM thing to my bot, but that's how the answers look like. Since I'm going through the OpenAI API I wonder, what can I do about that?

No description

!x.com/dominicfreiOP•12mo ago

cc @drycoco Not quite there yet but I'm working on it. But maybe you can help me with that question. I sometimes get 2min cold boots. What's your advice on workers (active, etc.) and how to make sure that cold boots never take long. 🙂 Do I need to keep active workers at 1? How does the serveless vLLM template parallel requests? Also, should I create new threads for those questions? We've been drifting away from the original topic quite a bit. 😄

Wolfsauge•12mo ago

that's bugs which meta fixed after releasing the models it's in the config files.. generation_config.json, tokenizer_config.json, etc

Wolfsauge•12mo ago

13 fixes after model release, one or more of them is required to fix the astray eot_id etc eos token, which aren't properly recognized

No description

Wolfsauge•12mo ago

however, also vllm-0.4.0 has similar symptoms, even when using the fixed model the minimum vllm version which works for me and llama 3 is 0.4.1 to make it work you need both the fixes in the model files and the updated app, in this case vllm last time i've checked, the non-gated reuploads of the latest meta models by nousresearch https://huggingface.co/NousResearch and their derivatives sometimes have the issues in generation_config and tokenizer_config fixed, but in different ways than meta - but i don't want to complicate things unecessarily chances are your vllm is out of date or the model files you're using aren't "fixed enough" in this or the other way. and it's probably both

!x.com/dominicfreiOP•12mo ago

You always know everything, @Wolfsauge 😍 Unfortunately, the serverless config (https://github.com/runpod-workers/worker-vllm) is vLLM 0.3, so what can we do about that? 🤔

GitHub

GitHub - runpod-workers/worker-vllm: The RunPod worker template for...

The RunPod worker template for serving our large language model endpoints. Powered by vLLM. - runpod-workers/worker-vllm

Wolfsauge•12mo ago

i don't know. i believe there is no other way than updating to 0.4.1 for llama 3

!x.com/dominicfreiOP•12mo ago

But I don't have access to the underlying pod in serverless, I can only deploy templates provided in runpod (if I understand correctly). @Papa Madiator Maybe you can clarify? 🙂

Madiator2011•12mo ago

No you can make own templates to o

!x.com/dominicfreiOP•12mo ago

Ah, cool! Can you point me to the documentation about how that works on runpod and how to get started? 🙂

Madiator2011•12mo ago

Wdym?

!x.com/dominicfreiOP•12mo ago

How can I create my own template? It didn't 'click' in my head yet how to get started? At least not when posting that message. I've done some research on my own since then. Can I just a new docker image locally, push it to DockerHub and then all I need is the name for the image here?

No description

Madiator2011•12mo ago

Build docker container push to docker registry then add new docker image to template

!x.com/dominicfreiOP•12mo ago

Great. Wasn't obvious to me that it's that easy. 😄 Thank you! I guess I have to try that next to get to my goal. haha So, I got vLLM running locally now to test it out and see if that's an option. The results are really cool, but not the right approach for me. 😄

============================= test session starts =============================
collecting ... collected 1 item

test_model_wrapper.py::test_complete 

======================== 1 passed in 100.25s (0:01:40) ========================
PASSED                              [100%]

Time to set up: 0.1114599 seconds


Runs: [31.6107654, 31.6132964, 34.2333528, 34.2130582, 34.2042625, 34.2340227, 34.2474756, 34.2105221, 34.2394911, 34.2595923, 34.2379803, 34.2343132, 34.2297294, 34.2636501, 34.223144, 34.2007176, 34.2261261, 34.1895347, 34.2410051, 34.2694363, 34.2138979, 34.219749, 34.2380913, 34.2763124, 34.2519246, 34.24931, 34.2471614, 34.2769107, 42.0093234, 42.0015118, 42.0404314, 42.0552968, 42.0382447, 42.0084274, 42.0031958, 42.054302, 42.002341, 42.0075319, 42.0172066, 47.9992168, 47.953934, 47.9645648, 48.0265022, 48.0249268, 47.951652, 48.0254267, 47.9690977, 49.3527739, 49.3451308, 50.7696567, 50.767339, 50.7534204, 50.775435, 50.759712, 50.7630917, 50.7589791, 50.7768211, 53.4804957, 53.4305316, 56.1890333, 56.1694433, 56.1409132, 56.1523421, 56.1788844, 56.1737424, 56.1204063, 56.1047533, 56.1247003, 56.1293441, 56.1116698, 56.1414604, 57.3100886, 57.2900195, 57.3443251, 58.5790726, 59.776073, 59.6932115, 59.7322922, 59.7565171, 59.7780871, 59.7480857, 60.7950622, 63.1008595, 62.9985082, 63.0248589, 63.0549046, 64.3303572, 64.3385164, 64.3284183, 68.3025694, 68.2875625, 69.6675595, 69.6902716, 70.5738979, 70.6244101, 70.5820796, 71.4233821, 78.2565978, 81.2923646, 99.9321581]


Average run time: 49.985901578 seconds

============================= test session starts =============================
collecting ... collected 1 item

test_model_wrapper.py::test_complete 

======================== 1 passed in 100.25s (0:01:40) ========================
PASSED                              [100%]

Time to set up: 0.1114599 seconds


Runs: [31.6107654, 31.6132964, 34.2333528, 34.2130582, 34.2042625, 34.2340227, 34.2474756, 34.2105221, 34.2394911, 34.2595923, 34.2379803, 34.2343132, 34.2297294, 34.2636501, 34.223144, 34.2007176, 34.2261261, 34.1895347, 34.2410051, 34.2694363, 34.2138979, 34.219749, 34.2380913, 34.2763124, 34.2519246, 34.24931, 34.2471614, 34.2769107, 42.0093234, 42.0015118, 42.0404314, 42.0552968, 42.0382447, 42.0084274, 42.0031958, 42.054302, 42.002341, 42.0075319, 42.0172066, 47.9992168, 47.953934, 47.9645648, 48.0265022, 48.0249268, 47.951652, 48.0254267, 47.9690977, 49.3527739, 49.3451308, 50.7696567, 50.767339, 50.7534204, 50.775435, 50.759712, 50.7630917, 50.7589791, 50.7768211, 53.4804957, 53.4305316, 56.1890333, 56.1694433, 56.1409132, 56.1523421, 56.1788844, 56.1737424, 56.1204063, 56.1047533, 56.1247003, 56.1293441, 56.1116698, 56.1414604, 57.3100886, 57.2900195, 57.3443251, 58.5790726, 59.776073, 59.6932115, 59.7322922, 59.7565171, 59.7780871, 59.7480857, 60.7950622, 63.1008595, 62.9985082, 63.0248589, 63.0549046, 64.3303572, 64.3385164, 64.3284183, 68.3025694, 68.2875625, 69.6675595, 69.6902716, 70.5738979, 70.6244101, 70.5820796, 71.4233821, 78.2565978, 81.2923646, 99.9321581]


Average run time: 49.985901578 seconds

100 completions in about 100s is amazing for the total result. But the individual completion is too slow. I wonder if tabbyAPI (with sequential requests) and multiple parallel workers might actually be better. I expect no more than 2-3 requests at the same time for now. And no more than 5 for a while (assumptions can be wrong of course).

rien19ma•12mo ago

Hi @Papa Madiator , This is a weird question, but what if I'm too GPU-poor on my laptop to handle the build and a test of a docker image (model included) for a llama3 70B quantized model, for example? (not exactly picked randomly; it's currently my need :)) I'm ending up popping an instance on GCP to build and test my image before pushing it to Docker Hub, making it accessible for Rundod. Is this the correct workflow? Is there any chance we can exit the running container from inside the POD and reach the underlying Linux where we can docker build . [...] && docker push?

Jason•12mo ago

GPU-poor? meaning your hardware GPU requirement isnt enough to build docker image? building doesnt need gpu power except you run the container locally also another alternative you can also test on runpod too No you can't get access to the main linux host

rien19ma•12mo ago

Hi, Thanks for your prompt answer. Yes, "GPU-poor" because my laptop has a really bad consumer-class GPU. You might need GPU access even during build time if, for example, you download a model from HF with vllm. But even if you don't have the need at build time, I will need it at run time for the test, and as mentioned earlier, unfortunately, I can't do that locally. So, what is the workflow you advised? I'm not sure I fully grasp your second point: also another alternative you can also test on runpod too ? Thanks a lot 🙂

Jason•12mo ago

well you build the image, then deploy it to runpod that whats i meant haha

rien19ma•12mo ago

ahah

Jason•12mo ago

not really effective i guess but for using the gpu i think thats the only way? downloading with vllm needs gpu huh? never knew that does that works for you?

rien19ma•12mo ago

I will rephrase it to be more accurate. I'm using the HF tool to download the model within my image (AutoTokenizer.from_pretrained, AutoModelForCausalLM.from_pretrained), and they raised an error if they can't find the GPU. To be more specific, it's actually quantization_config.py, which complains about not finding a GPU for AWQ. I don't think the GPU is involved at any point during downloading; I guess it's just a check to avoid uncesseray model downloading if you don't have a GPU. But I admit I have been lazy, and I will find a way to overcome that point (by downloading the model with another tool) ^^

Jason•12mo ago

Oh yeah correct if you're using that code it checks for gpu try use git on hf maybe? i have no experience for downloading huggingface models without the code sorry 🙂

rien19ma•12mo ago

Thanks for the advice. I will give it a try. Just to summarize my understanding of the workflow: 1. Build the image locally (in my case: vllm, model (50go) and runpod required handler) 2. push it to dockerhub 3. deploy a pod/serverless on runpod and give it a try 4. if the output is not as expected, restart from Step 1 M'I correct?

Jason•12mo ago

Yep yep like that wow nice formatting

J.•12mo ago

Personally, what I would do just as advice: https://discord.com/channels/912829806415085598/1194695853026328626 1) Start up a Pytorch Template from Runpod with a GPU pod 2) Manually go through the steps and record it all on the side 3) From there, becomes much easier to confirm if it works 4) Then push to dockerhub 5) Then confirm that your template works by downloading it on the Pod The reason why is b/c then you can use the pytorch template as a base, and it comes with a lot of stuff in the background https://github.com/justinwlin/Runpod-OpenLLM-Pod-and-Serverless Can look to how I did it here, I even modified it, so that I overwrite Runpod's start.sh script, so that I can launch jupyter server without a password Otherwise for modified templates with a jupyter notebook server, it will usually ask you for a token, which is a bit annoying.

rien19ma•12mo ago

Is pushing large-size images (> 10go) on Dockerhub painful (really slow) just for me (and my bad upload internet connection), or is it a pain in the workflow for everyone?

Jason•12mo ago

Oof Hey try building in docker cloud or depot dev Docker cloud build search that up Or github runners if possible too

rien19ma•12mo ago

Thanks, Justin, I checked your repository before asking my question here. For context, I need to create a working Docker image containing a llama3 70B quantized model. Currently (as other users mentioned in this channel), it's not financially viable due to the delay time (around 120s) using the Vllm template. This implies at 0.0025 $/s a 0.3$ cost to pop the serverless endpoint. On the runpod doc and on repositories affiliated with runpod, it's mentioned that including the model inside the image reduces the delay time (some users on r/LocalLLama claim impressive figures using this technique). To do so and to test if it works, I was renting an instance on GCP (A100) to build and test my docker image, and I came here to ask if it was the right workflow 🙂 All of that to say, I'm not sure how your repository (and your workflow) will fit my need for my usecase; wdyt ? Thanks 🙂 Thanks 🙂

J.•12mo ago

Showing my repository to show you the Dockerfile haha. Just wondering, why not use like an API by anthropic or mixtral - or are you in prod and why you're looking to optimize costs by self hosting

rien19ma•12mo ago

Ahaha thanks actually I used some part of your dockerfile 🙂

J.•12mo ago

I think the repository I use can also use Llama3, even though not on the github readme I was reading the OpenLLM issues some people say that you can use the llama3 with a vllm backend Obvs my repo is different than an officially supported runpod image tho but just cause i like my methodology more Also yes, including a llama3 model inside the image is much faster the issue is that im not sure a 70B model is able to load easily once images get to that size, to spin up docker images / build it becomes painful Why I think using an API for LLMs tends to be better since they solve a lot of problems Like ex. a llama2 70b model is above 100gb: https://hub.docker.com/layers/justinwlin/llama2_70b_openllm/latest/images/sha256-c481f8dd51481daf995bfbe7075ee44bf0e1dc680cbda767cc97fcb7fd83f5a4?context=repo at least when i tried to build it This is a painful initialization time, so better for models this big to be stored in a network volume, if you really want to self-host your own llm

rien19ma•12mo ago

The idea is to validate that a specific 70B llama3 quantized is good enough for our use case (which implies non-technical teams playing with the model) before using it with Vllm on batching mode (therefore no serverless), with millions of prompts to pass every X.

J.•12mo ago

Got it You can probably just: 1) Start up a Pytorch template with a network volume attach 2) Download the 70B model from that GPU Pod (or use a CPU Pod since this is just for pure downloading and not GPU processing) 3) in the future, you can always launch new pods all attached to the network volume and they will all have access to the 70B model 🙂 that is what I would do Make sure that whatever region you are using for network volume, has the correct GPU Pod that you want to test with avaliable

rien19ma•12mo ago

I haven't try yet, but your solution just above was also recommended here: https://www.reddit.com/r/SillyTavernAI/comments/1app7gv/new_guides_and_worker_container_for/

Reddit

From the SillyTavernAI community on Reddit: New guides and worker c...

Explore this post and more from the SillyTavernAI community

J.•12mo ago

ooo nice

rien19ma•12mo ago

They also mentioned that even with this solution, they still have some delay time (not 120s of, but still). Some mention ms of delay time when the model is included, but you might be right; size matters...

rien19ma•12mo ago

Do you remember the ~build time for that ?

No description

Jason•12mo ago

No problem

J.•12mo ago

I don't know, I use something called depot.dev though for it, and had a 500gb cache to build it lol. Not sure it worth it though, I wouldn't do it again for any images > 35GB size

J.•12mo ago

https://depot.dev/

Depot

The fastest way to build Docker images.

J.•12mo ago

I really think that's why for anything with more complex tasks for LLMs, I just use paid API services, and go through a finetuning process, or I just use the avaliable zeroshot model It just isn't worth self-hosting an LLM unless I really dial it in, and it needs to be for a smaller model like a mistral7b

rien19ma•12mo ago

Thanks for the feedback 🙂

!x.com/dominicfreiOP•12mo ago

That's what I'm currently considering. Building the image myself using tabbyAPI and https://huggingface.co/hjhj3168/Llama-3-8b-Orthogonalized-exl2 I wish there was anyone out there, offering a service that's running https://huggingface.co/hjhj3168/Llama-3-8b-Orthogonalized-exl2 or https://huggingface.co/Undi95/Llama3-Unholy-8B-OAS but it seems like I have to host them myself. 😄 Like any decent unceonsred RP capable model it seems. Or would you know of any?

Jason•12mo ago

I've ever heard some websites offering hosted models for Roleplay try finding some AI chat Roleplay websites

We're a community of enthusiasts, engineers, and enterprises, all sharing insights on AI, Machine Learning and GPUs!

17KMembers

View on Discord

Did you find this page helpful?