L40 Thermal throttling

We noticed we are having an occasional big slow down when running our models. from a 10-15 second calculation to 90-120 seconds.
Test run on pod: 8hh03rby46hd8s - when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
- as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
We're getting 65% of the performance of desired GPU Example:
# nvidia-smi dmon
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk
0 80 44 - 0 0 0 0 0 0 9001 2490
0 92 44 - 86 4 0 0 0 0 9001 2490
0 272 53 - 100 42 0 0 0 0 9001 2145
0 297 55 - 100 44 0 0 0 0 9001 1770
0 295 55 - 100 44 0 0 0 0 9001 1680
0 300 56 - 100 41 0 0 0 0 9001 1635
0 299 56 - 100 40 0 0 0 0 9001 1755
0 299 57 - 100 40 0 0 0 0 9001 1725
0 301 57 - 100 43 0 0 0 0 9001 1740
0 304 58 - 100 42 0 0 0 0 9001 1680
0 98 49 - 86 4 0 0 0 0 9001 2490
# nvidia-smi dmon
# gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk
0 80 44 - 0 0 0 0 0 0 9001 2490
0 92 44 - 86 4 0 0 0 0 9001 2490
0 272 53 - 100 42 0 0 0 0 9001 2145
0 297 55 - 100 44 0 0 0 0 9001 1770
0 295 55 - 100 44 0 0 0 0 9001 1680
0 300 56 - 100 41 0 0 0 0 9001 1635
0 299 56 - 100 40 0 0 0 0 9001 1755
0 299 57 - 100 40 0 0 0 0 9001 1725
0 301 57 - 100 43 0 0 0 0 9001 1740
0 304 58 - 100 42 0 0 0 0 9001 1680
0 98 49 - 86 4 0 0 0 0 9001 2490
13 Replies
Poddy
Poddy2w ago
@jphipps
Escalated To Zendesk
The thread has been escalated to Zendesk!
Jason
Jason2w ago
any pod id?
jphipps
jphippsOP2w ago
pod: 8hh03rby46hd8s @Jason any update?
Dj
Dj2w ago
I can forward the problematic node to the datacenter to look into. Sorry I hadn't seen this one sooner. @jphipps - The L40 GPU is capped at 300 Watts, when you're at full capacity you're hitting the power limit. To make up for this, as you've noticed the GPU slows down it's clock speed. As soon as your usage goes back down, the power GPU no longer has to accommodate a higher power draw and raises it's clock speed. You're hitting the limits of this card, and by splitting the workload across more than one GPU you won't have this issue anymore.
Lahl
Lahl2w ago
hey @Dj We're not experiencing these bottlenecks on Coreweave L40s
Dj
Dj2w ago
Coreweave appears to give their customers 8 L40s with their bare metal product. The pod in question has one GPU connected.
Lahl
Lahl2w ago
L40 specs: - sustained power limit: 300W - base clock of 2250Mhz - boost clock of 2490Mhz so it should boost to 2490Mhz which it does, but then it should slow down to base clock of 2250Mhz, not slow down all the way down to 1650Mhz range. The design limit of the card is to run at 2250Mhz sustained so why are we being throttled?
Dj
Dj2w ago
In my testing of a random L40 in another datacenter, I can see the following:
GPU Power Readings
Power Draw : 36.32 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 1185 MHz
GPU Power Readings
Power Draw : 36.32 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 1185 MHz
Max Customer Boost Clocks
Graphics : 2490 MHz
Max Customer Boost Clocks
Graphics : 2490 MHz
Which would indicate a sustained power limit of 300W and a base/boost that aligns with the datasheet for the L40. You can view the datasheet here (link to the pdf) or here (link to nvidia's l40 homepage) and then clicking "Download the NVIDIA L40 Product Brief".
No description
Dj
Dj2w ago
This is also the case when I manually run a stress test on the device, it does it's burst, then hits its power limit and provides it's lowered clock speed*
No description
Dj
Dj2w ago
It has been a while since I needed to make a spreadsheet so it's not super pretty but it works. I ran the same stress and logged every 500ms rather than 1s (like above) which is a little easier to see. When the GPU believed it was under stress but didn't hit its power draw cap it provided you the speed you expect, but when its under load it really can't do that and the longer the test ran the more you can see it decline. When the test is cancelled we see another burst related to finishing the job and it falls back down to the expected clock of 210.
No description
Dj
Dj2w ago
You can run: nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.gr,clocks.sm,clocks.mem --format=csv -lms 500 To output data fit for similar visualization.
riverfog7
riverfog76d ago
My experience is simillar. There are some GPUs in dangerous levels (not throttling but shows insanely high temps for a datacenter GPU)
Henky!!
Henky!!6d ago
Also account for memory heat, that can sometimes cause seperate issues

Did you find this page helpful?