L40 Thermal throttling
We noticed we are having an occasional big slow down when running our models. from a 10-15 second calculation to 90-120 seconds.
Test run on pod: 8hh03rby46hd8s - when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
- as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
We're getting 65% of the performance of desired GPU Example:
Test run on pod: 8hh03rby46hd8s - when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
- as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
We're getting 65% of the performance of desired GPU Example:
13 Replies
@jphipps
Escalated To Zendesk
The thread has been escalated to Zendesk!
any pod id?
pod: 8hh03rby46hd8s
@Jason any update?
I can forward the problematic node to the datacenter to look into. Sorry I hadn't seen this one sooner.
@jphipps - The L40 GPU is capped at 300 Watts, when you're at full capacity you're hitting the power limit. To make up for this, as you've noticed the GPU slows down it's clock speed. As soon as your usage goes back down, the power GPU no longer has to accommodate a higher power draw and raises it's clock speed.
You're hitting the limits of this card, and by splitting the workload across more than one GPU you won't have this issue anymore.
hey @Dj We're not experiencing these bottlenecks on Coreweave L40s
Coreweave appears to give their customers 8 L40s with their bare metal product. The pod in question has one GPU connected.
L40 specs:
- sustained power limit: 300W
- base clock of 2250Mhz
- boost clock of 2490Mhz
so it should boost to 2490Mhz which it does, but then it should slow down to base clock of 2250Mhz, not slow down all the way down to 1650Mhz range.
The design limit of the card is to run at 2250Mhz sustained so why are we being throttled?
In my testing of a random L40 in another datacenter, I can see the following:
Which would indicate a sustained power limit of 300W and a base/boost that aligns with the datasheet for the L40.
You can view the datasheet here (link to the pdf) or here (link to nvidia's l40 homepage) and then clicking "Download the NVIDIA L40 Product Brief".

This is also the case when I manually run a stress test on the device, it does it's burst, then hits its power limit and provides it's lowered clock speed*

It has been a while since I needed to make a spreadsheet so it's not super pretty but it works. I ran the same stress and logged every 500ms rather than 1s (like above) which is a little easier to see. When the GPU believed it was under stress but didn't hit its power draw cap it provided you the speed you expect, but when its under load it really can't do that and the longer the test ran the more you can see it decline. When the test is cancelled we see another burst related to finishing the job and it falls back down to the expected clock of 210.

You can run:
nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.gr,clocks.sm,clocks.mem --format=csv -lms 500
To output data fit for similar visualization.My experience is simillar. There are some GPUs in dangerous levels (not throttling but shows insanely high temps for a datacenter GPU)
Also account for memory heat, that can sometimes cause seperate issues