R
RunPod2mo ago
anders

All pods unavailable | help needed for future proof strategy

Region eu-se-1 has all pods unavailable for serverless. I need to protect against this because SLA - it's hard because I litteraly don't know how or where to read about it - on Monday a 1000-2000 usd a month need is expected so would love help. Maybe I am stupid, but I will have to look for alternatives I'm ofc stressed a bit. Hope you guys figure it out, and or can help me avoid and monitor this problem in the future. -yes I can setup endpoint on all clouds, but truly I would need to set active worker to avoid this issue, which defeats the purpose of server less, unless I can predict the future and set active workers before others, but I don't want to have to program that algorithm.
8 Replies
flash-singh
flash-singh2mo ago
let me try to help here, eu-se-1 is running into network packet loss issues and we have disabled it to avoid users getting charged for downtime your using network storage and that's why you need eu-se-1? if so, can you elaborate more how to save / load from storage, we are actively trying to find solutions to endpoints that need network storage, how to load balance them to other regions without data loss
nerdylive
nerdylive2mo ago
generally we just download files, set the folders and files via pod and run scripts/apps from the endpoint ( serverless ) or load the models from the endpoints but with the scripts/apps in image. is that what you need? by copying to other region to create redundancy works i guess but if the availibility isnt great enough then well.. its not great
flash-singh
flash-singh2mo ago
availability is good across regions, in a single region in this case is related to outage and issues with their ISP, we are trying to figure out best ways to provide multi-region endpoints even with network storage so if network storage is empty, you have a way to fill it or is it source of truth?
nerdylive
nerdylive2mo ago
Oh wont that be slow unless theres some great connection service for it? depends, some have backups like models in huggingface and scripts to download them, or maybe script to setup venv & install pip packages ( which is stored in network storage too ) also on vllm worker i queued one request and it works just fine, launched a worker but it is still loading a huge model so it took abit long ( running ) but it launched a new worker after running like 2-3 mins~ is that normal? @flash-singh
flash-singh
flash-singh2mo ago
our vllm deploy requires the model to be downloaded while its running, this is something we are working on improving so download the model is done in INIT step and doesn't cost anything
nerdylive
nerdylive2mo ago
Wow thats really great if you guys do that but the launching new workers after 2-3 mins ( the same request ) why is that?
flash-singh
flash-singh2mo ago
its likely downloading the model and serverless realizing request is still pending and launches another, downloading during init will fix this behaviour
nerdylive
nerdylive2mo ago
Ic right thanks im goin to sleep feel free to delete my message if this distrupts the original problem in this thread