same GPU, different machine -> different speed
The image shows 2 yolo object detection runs with identical setup (same batch size, image size, number of epochs) on 2 different runpods. The GPU was in both cases the RTX 4090
slow machine
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:A1:00.0 Off | Off |
fast machine
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
There was a 30% increase in training speed on the fast machine, and the power consumption was less.
(1) Is this only due to the driver being newer?
(2) Would the effect be the same for an older GPU, like the A100 ?
6 Replies
Check the power watts, one may be power capped.
but why the difference in speed?
Power capped machines are slower because power capping cripples performance
Check the power isn't cappped
I have had 4090's power capped before in FR region in community cloud and complained to RunPod to get a refund and asked for the machine to be delisted and the host to be banned from RunPod because it amounts to fraud.
hm... but in my case the GPU consuming less power, is also faster. I would expect it to be slower. Also, the "red" GPU (slower speed, higher power consumption) was on a secure cloud, while the "blue" one is on community cloud
Its not about the consumption, its about how much the max power is, on 4090 it should be 450W.
okay, but I have a small dataset, RAM use is below 10%. Probably the GPU cannot operate at its max because the epochs are too small? Or is max power consumption indepent of VRAM use?
regarding you comment about how much the max power is, in my plots 70W consumption is about 10% of max power, meaning that 100% would be way above 450W