R
RunPod8mo ago
Xangelix

Questions on large LLM hosting

1 I see mentions of keeping a model in a Network Volume to share between all endpoints. But if I already have my model inside of a container image-- wouldn't my model already be cached in that image? Which would be faster for cold boots? 2 My workload is not consistent, so I understand FlashBoot is unlikely to help a lot-- but is there any reason not to enable it? When I hover over it, it indicates to test output quality first-- what does this mean and why? 3 What is "container disk"? My models are already inside my image and they seem to load fine-- so what is the purpose of this? Additional space to be used at runtime-- like if I was downloading a model when the container starts?
No description
6 Replies
Xangelix
XangelixOP8mo ago
4 an extra one How much VRAM does VLLM typically use outside of the weights? I'm testing a model now that only uses 38GB in weights, but I'm getting OOM on 48GB gpus...
Xangelix
XangelixOP8mo ago
No description
Xangelix
XangelixOP8mo ago
could this be fixed with ENFORCE_EAGER=1 ?
nerdylive
nerdylive8mo ago
1. Correct 2. Search it up, it does sometimes but to make sure try it yourself 3. Yes
Xangelix
XangelixOP8mo ago
this didn't seem to lower it enough, is this message a typo? wouldn't I want to raise GPU_MEMORY_UTILIZATION if i'm getting OOM https://discord.com/channels/912829806415085598/1211740161948524564/1212674202465869864
nerdylive
nerdylive8mo ago
Try sending logs in text format yeah, next time Not sure ah never used hf's on deployment, but running the model can take more
Want results from more Discord servers?
Add your server