R
RunPod•2mo ago
jphipps

Network and Local Storage Performance

Hi, we are noticing very slow performance loading in our model on our Pods in the IS region. We are also noticing a very slow sequential read time when we copy the same model into local storage. The model loading takes about 10x as much time as it did for us on a different network. When we compare the sequential read time, we see about a 3x increase in time on Runpod. Local storage is about 5s faster than network storage. our old network read
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 7.09554 s, 978 MB/s

real 0m7.283s
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 7.09554 s, 978 MB/s

real 0m7.283s
Runpod Network storage
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 20.3418 s, 341 MB/s

real 0m20.351s
user 0m2.008s
sys 0m12.863s
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 20.3418 s, 341 MB/s

real 0m20.351s
user 0m2.008s
sys 0m12.863s
Runpod Local storage
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 17.5572 s, 395 MB/s

real 0m17.560s
user 0m2.279s
sys 0m15.259s
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 17.5572 s, 395 MB/s

real 0m17.560s
user 0m2.279s
sys 0m15.259s
8 Replies
Dj
Dj•2mo ago
Thank you for the report! I'll file this with engineering. They may ask, so I'll ask too - can you share the command you're using to generate these tests? I think I see dd and time?
jphipps
jphippsOP•2mo ago
Sounds good thanks. Here is the cmd time dd if=/tmp/icbinpXL_v6.safetensors of=/dev/null same command for the network storage just different path
brennen_runpod
brennen_runpod•2mo ago
Hey @jphipps - could you share which IS datacenter you are using? We have three DC's in that region 🙂 Also what was the old network region/DC you were using?
jphipps
jphippsOP•2mo ago
It is EUR-IS-1 old was US-TX-3
Lahl
Lahl•2mo ago
Hey there. Can we get an update on this? It's creating a major performance impact for our users.
brennen_runpod
brennen_runpod•2mo ago
I'm following up on this now - just to make sure I have the right information on your workload: 1. Copy model from Network Storage to Local Storage 2. Load model from Local Storage into GPU VRam 3. Perform work Step 1 is the one which is most impacted by EUR-IS-1, correct?
Dj
Dj•2mo ago
@Lahl, @jphipps We'd love to understand this better for you.
jphipps
jphippsOP•2mo ago
@brennen_runpod @Dj sorry I missed this message yesterday. yes that's right. Model will also need to sometimes swap between GPU and CPU depending on the size. The read time seems to be very slow

Did you find this page helpful?