RunPod•10mo ago

All pods unavailable | help needed for future proof strategy

Region eu-se-1 has all pods unavailable for serverless. I need to protect against this because SLA - it's hard because I litteraly don't know how or where to read about it - on Monday a 1000-2000 usd a month need is expected so would love help. Maybe I am stupid, but I will have to look for alternatives I'm ofc stressed a bit. Hope you guys figure it out, and or can help me avoid and monitor this problem in the future. -yes I can setup endpoint on all clouds, but truly I would need to set active worker to avoid this issue, which defeats the purpose of server less, unless I can predict the future and set active workers before others, but I don't want to have to program that algorithm.

8 Replies

flash-singh•10mo ago

let me try to help here, eu-se-1 is running into network packet loss issues and we have disabled it to avoid users getting charged for downtime your using network storage and that's why you need eu-se-1? if so, can you elaborate more how to save / load from storage, we are actively trying to find solutions to endpoints that need network storage, how to load balance them to other regions without data loss

nerdylive•10mo ago

generally we just download files, set the folders and files via pod and run scripts/apps from the endpoint ( serverless ) or load the models from the endpoints but with the scripts/apps in image. is that what you need? by copying to other region to create redundancy works i guess but if the availibility isnt great enough then well.. its not great

flash-singh•10mo ago

availability is good across regions, in a single region in this case is related to outage and issues with their ISP, we are trying to figure out best ways to provide multi-region endpoints even with network storage so if network storage is empty, you have a way to fill it or is it source of truth?

nerdylive•10mo ago

Oh wont that be slow unless theres some great connection service for it? depends, some have backups like models in huggingface and scripts to download them, or maybe script to setup venv & install pip packages ( which is stored in network storage too ) also on vllm worker i queued one request and it works just fine, launched a worker but it is still loading a huge model so it took abit long ( running ) but it launched a new worker after running like 2-3 mins~ is that normal? @flash-singh

flash-singh•10mo ago

our vllm deploy requires the model to be downloaded while its running, this is something we are working on improving so download the model is done in INIT step and doesn't cost anything

nerdylive•10mo ago

Wow thats really great if you guys do that but the launching new workers after 2-3 mins ( the same request ) why is that?

flash-singh•10mo ago

its likely downloading the model and serverless realizing request is still pending and launches another, downloading during init will fix this behaviour

nerdylive•10mo ago

Ic right thanks im goin to sleep feel free to delete my message if this distrupts the original problem in this thread

Gaming

Programming

All pods unavailable | help needed for future proof strategy

Did you find this page helpful?