jphipps
L40 Thermal throttling
We noticed we are having an occasional big slow down when running our models. from a 10-15 second calculation to 90-120 seconds.
Test run on pod: 8hh03rby46hd8s - when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
- as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
We're getting 65% of the performance of desired GPU Example:
Test run on pod: 8hh03rby46hd8s - when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
- as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
We're getting 65% of the performance of desired GPU Example:
17 replies
Http reverse proxy disconnects
Hi, I have noticed that when we send an API request to our Pod, we are getting a 524 response back if we send through the reverse proxy and the inference job takes over ~20 seconds. This does not happen when we use direct tcp. But with that method, we have to handle dynamic ports. This is happening on all of our pods running this job. Mostly 4090s
2 replies
Network and Local Storage Performance
Hi, we are noticing very slow performance loading in our model on our Pods in the IS region. We are also noticing a very slow sequential read time when we copy the same model into local storage. The model loading takes about 10x as much time as it did for us on a different network. When we compare the sequential read time, we see about a 3x increase in time on Runpod. Local storage is about 5s faster than network storage.
our old network read
Runpod Network storage
Runpod Local storage
12 replies
4090 Power capped
Hi, I was testing an inference job on a 4090 pod. I noticed it was running very slowly. When I checked the nvidia logs, I noticed a "sw power cap" message when it got to about 1/3 of the Power (450W). How do we get full performance of our 4090 GPU?
6 replies
Network Storage question
Hi, I am looking to create several GPU pods that all share the same shared network storage. When I go to create a network storage, it looks like I have to deploy a new GPU pod that is always running. How do I create a storage that doesn't rely on a GPU pod being always on? I want to be able to turn off these pods when they are not being used and use the shared storage when they turn back on.
4 replies