RunPod•13mo ago

Questions on large LLM hosting

1 I see mentions of keeping a model in a Network Volume to share between all endpoints. But if I already have my model inside of a container image-- wouldn't my model already be cached in that image? Which would be faster for cold boots? 2 My workload is not consistent, so I understand FlashBoot is unlikely to help a lot-- but is there any reason not to enable it? When I hover over it, it indicates to test output quality first-- what does this mean and why? 3 What is "container disk"? My models are already inside my image and they seem to load fine-- so what is the purpose of this? Additional space to be used at runtime-- like if I was downloading a model when the container starts?

6 Replies

XangelixOP•13mo ago

4 an extra one How much VRAM does VLLM typically use outside of the weights? I'm testing a model now that only uses 38GB in weights, but I'm getting OOM on 48GB gpus...

XangelixOP•13mo ago

could this be fixed with ENFORCE_EAGER=1 ?

Jason•13mo ago

1. Correct 2. Search it up, it does sometimes but to make sure try it yourself 3. Yes

XangelixOP•13mo ago

this didn't seem to lower it enough, is this message a typo? wouldn't I want to raise GPU_MEMORY_UTILIZATION if i'm getting OOM https://discord.com/channels/912829806415085598/1211740161948524564/1212674202465869864

Jason•13mo ago

Try sending logs in text format yeah, next time Not sure ah never used hf's on deployment, but running the model can take more

Gaming

Programming

Questions on large LLM hosting

Did you find this page helpful?