R
RunPod7mo ago
AC_pill

Model loadtime affected if PODs are running on the same server

I was trying to debug the latency on my test PODs and now I figured that PODs running on the same physical machine are lagging too much on IO access. After profilling, I've got these results. Example: Initial test on POD - running on a single POD model load time for 6Gb model is 2 sec - when I pulled 2 GPUs from the same server model load increased to 40 sec Even inference is affected, RAM leaking? On Serverless: - Same GPU 4090, gets different inference and load time as well - 30s for loading, 4 sec depending on the machine - inference is non uniform as well: 20s on some and 10s on some All running the same docker, and same scripts with the same libraries. Do we have any work in place to ensure we have uniformity on HW? Are we enforcing servers to have separate SSD / NVME for each GPU and including different pipe for IO access? Need to have some idea if this is persisting issue, I'm pretty sure the Mbps on the descriptors are not reflecting the reality at all. EDIT: I'm using US region now, Global the problem is worse.
8 Replies
AC_pill
AC_pillOP7mo ago
Do we have any answers here?
nerdylive
nerdylive7mo ago
Hmm maybe in a same region?
AC_pill
AC_pillOP7mo ago
I was using Global before, the problem was worse, and now the same region GPUs are showing discrepancy as well. There is no uniformity on inference power. Maybe cap?
nerdylive
nerdylive7mo ago
or else you have to ask this on ticket maybe on pods? i heard some pods have t his problem
AC_pill
AC_pillOP7mo ago
there is no support here? This is extreme important to share in a board, so we can see the problem repeats yeah the problem is on Serverless and PODs, I'm stress testing and it's clear now it's a hardware issue
nerdylive
nerdylive7mo ago
there is actually, but they're not pretty active here because they are easier to report problems in their own platform yeah or software cap maybe
AC_pill
AC_pillOP7mo ago
interesting, I'll post that, but leave this open so the other users can see, I'm already seeing a lot of complains on the same, so it's getting hard to push to production. Yes, software cap on docker host.
nerdylive
nerdylive7mo ago
alright @haris
Want results from more Discord servers?
Add your server