jphipps Posts - Answer Overflow

jphipps

•Created by jphipps on 3/25/2025 in #⛅｜pods

L40 Thermal throttling

We noticed we are having an occasional big slow down when running our models. from a 10-15 second calculation to 90-120 seconds.
Test run on pod: 8hh03rby46hd8s - when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
- as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
We're getting 65% of the performance of desired GPU Example:

# nvidia-smi dmon
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk
    0     80     44      -      0      0      0      0      0      0   9001   2490
    0     92     44      -     86      4      0      0      0      0   9001   2490
    0    272     53      -    100     42      0      0      0      0   9001   2145
    0    297     55      -    100     44      0      0      0      0   9001   1770
    0    295     55      -    100     44      0      0      0      0   9001   1680
    0    300     56      -    100     41      0      0      0      0   9001   1635
    0    299     56      -    100     40      0      0      0      0   9001   1755
    0    299     57      -    100     40      0      0      0      0   9001   1725
    0    301     57      -    100     43      0      0      0      0   9001   1740
    0    304     58      -    100     42      0      0      0      0   9001   1680
    0     98     49      -     86      4      0      0      0      0   9001   2490

# nvidia-smi dmon
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk
    0     80     44      -      0      0      0      0      0      0   9001   2490
    0     92     44      -     86      4      0      0      0      0   9001   2490
    0    272     53      -    100     42      0      0      0      0   9001   2145
    0    297     55      -    100     44      0      0      0      0   9001   1770
    0    295     55      -    100     44      0      0      0      0   9001   1680
    0    300     56      -    100     41      0      0      0      0   9001   1635
    0    299     56      -    100     40      0      0      0      0   9001   1755
    0    299     57      -    100     40      0      0      0      0   9001   1725
    0    301     57      -    100     43      0      0      0      0   9001   1740
    0    304     58      -    100     42      0      0      0      0   9001   1680
    0     98     49      -     86      4      0      0      0      0   9001   2490

17 replies

RRunPod

•Created by jphipps on 3/7/2025 in #⛅｜pods

Http reverse proxy disconnects

Hi, I have noticed that when we send an API request to our Pod, we are getting a 524 response back if we send through the reverse proxy and the inference job takes over ~20 seconds. This does not happen when we use direct tcp. But with that method, we have to handle dynamic ports. This is happening on all of our pods running this job. Mostly 4090s

2 replies

RRunPod

•Created by jphipps on 3/5/2025 in #⛅｜pods

Network and Local Storage Performance

Hi, we are noticing very slow performance loading in our model on our Pods in the IS region. We are also noticing a very slow sequential read time when we copy the same model into local storage. The model loading takes about 10x as much time as it did for us on a different network. When we compare the sequential read time, we see about a 3x increase in time on Runpod. Local storage is about 5s faster than network storage. our old network read

13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 7.09554 s, 978 MB/s

real    0m7.283s

13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 7.09554 s, 978 MB/s

real    0m7.283s

Runpod Network storage

13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 20.3418 s, 341 MB/s

real    0m20.351s
user    0m2.008s
sys     0m12.863s

13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 20.3418 s, 341 MB/s

real    0m20.351s
user    0m2.008s
sys     0m12.863s

Runpod Local storage

13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 17.5572 s, 395 MB/s

real    0m17.560s
user    0m2.279s
sys     0m15.259s

13550863+1 records in
13550863+1 records out
6938042106 bytes (6.9 GB, 6.5 GiB) copied, 17.5572 s, 395 MB/s

real    0m17.560s
user    0m2.279s
sys     0m15.259s

12 replies

RRunPod

•Created by jphipps on 3/4/2025 in #⛅｜pods

4090 Power capped

Hi, I was testing an inference job on a 4090 pod. I noticed it was running very slowly. When I checked the nvidia logs, I noticed a "sw power cap" message when it got to about 1/3 of the Power (450W). How do we get full performance of our 4090 GPU?

6 replies

RRunPod

•Created by jphipps on 2/6/2025 in #⛅｜pods

Network Storage question

Hi, I am looking to create several GPU pods that all share the same shared network storage. When I go to create a network storage, it looks like I have to deploy a new GPU pod that is always running. How do I create a storage that doesn't rely on a GPU pod being always on? I want to be able to turn off these pods when they are not being used and use the shared storage when they turn back on.

4 replies

Gaming

Programming