jphipps
jphipps
RRunPod
Created by jphipps on 3/25/2025 in #⛅|pods
L40 Thermal throttling
We noticed we are having an occasional big slow down when running our models. from a 10-15 second calculation to 90-120 seconds.
Test run on pod: 8hh03rby46hd8s - when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
- as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
We're getting 65% of the performance of desired GPU Example:
# nvidia-smi dmon
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk
0 80 44 - 0 0 0 0 0 0 9001 2490
0 92 44 - 86 4 0 0 0 0 9001 2490
0 272 53 - 100 42 0 0 0 0 9001 2145
0 297 55 - 100 44 0 0 0 0 9001 1770
0 295 55 - 100 44 0 0 0 0 9001 1680
0 300 56 - 100 41 0 0 0 0 9001 1635
0 299 56 - 100 40 0 0 0 0 9001 1755
0 299 57 - 100 40 0 0 0 0 9001 1725
0 301 57 - 100 43 0 0 0 0 9001 1740
0 304 58 - 100 42 0 0 0 0 9001 1680
0 98 49 - 86 4 0 0 0 0 9001 2490
# nvidia-smi dmon
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk
0 80 44 - 0 0 0 0 0 0 9001 2490
0 92 44 - 86 4 0 0 0 0 9001 2490
0 272 53 - 100 42 0 0 0 0 9001 2145
0 297 55 - 100 44 0 0 0 0 9001 1770
0 295 55 - 100 44 0 0 0 0 9001 1680
0 300 56 - 100 41 0 0 0 0 9001 1635
0 299 56 - 100 40 0 0 0 0 9001 1755
0 299 57 - 100 40 0 0 0 0 9001 1725
0 301 57 - 100 43 0 0 0 0 9001 1740
0 304 58 - 100 42 0 0 0 0 9001 1680
0 98 49 - 86 4 0 0 0 0 9001 2490
17 replies
RRunPod
Created by jphipps on 3/7/2025 in #⛅|pods
Http reverse proxy disconnects
Hi, I have noticed that when we send an API request to our Pod, we are getting a 524 response back if we send through the reverse proxy and the inference job takes over ~20 seconds. This does not happen when we use direct tcp. But with that method, we have to handle dynamic ports. This is happening on all of our pods running this job. Mostly 4090s
2 replies
RRunPod
Created by jphipps on 3/5/2025 in #⛅|pods
Network and Local Storage Performance
Hi, we are noticing very slow performance loading in our model on our Pods in the IS region. We are also noticing a very slow sequential read time when we copy the same model into local storage. The model loading takes about 10x as much time as it did for us on a different network. When we compare the sequential read time, we see about a 3x increase in time on Runpod. Local storage is about 5s faster than network storage. our old network read
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 7.09554 s, 978 MB/s

real 0m7.283s
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 7.09554 s, 978 MB/s

real 0m7.283s
Runpod Network storage
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 20.3418 s, 341 MB/s

real 0m20.351s
user 0m2.008s
sys 0m12.863s
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 20.3418 s, 341 MB/s

real 0m20.351s
user 0m2.008s
sys 0m12.863s
Runpod Local storage
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 17.5572 s, 395 MB/s

real 0m17.560s
user 0m2.279s
sys 0m15.259s
13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 17.5572 s, 395 MB/s

real 0m17.560s
user 0m2.279s
sys 0m15.259s
12 replies
RRunPod
Created by jphipps on 3/4/2025 in #⛅|pods
4090 Power capped
Hi, I was testing an inference job on a 4090 pod. I noticed it was running very slowly. When I checked the nvidia logs, I noticed a "sw power cap" message when it got to about 1/3 of the Power (450W). How do we get full performance of our 4090 GPU?
6 replies
RRunPod
Created by jphipps on 2/6/2025 in #⛅|pods
Network Storage question
Hi, I am looking to create several GPU pods that all share the same shared network storage. When I go to create a network storage, it looks like I have to deploy a new GPU pod that is always running. How do I create a storage that doesn't rely on a GPU pod being always on? I want to be able to turn off these pods when they are not being used and use the shared storage when they turn back on.
4 replies