RunPod•2w ago

L40 Thermal throttling

We noticed we are having an occasional big slow down when running our models. from a 10-15 second calculation to 90-120 seconds.
Test run on pod: 8hh03rby46hd8s - when power draw goes to ~300W and SM usage to ~100%, GPU clock drops from 2490Mhz to 1650Mhz
- as soon as as power draw drops to base of ~80-90W, GPU clock goes back to full speed
We're getting 65% of the performance of desired GPU Example:

# nvidia-smi dmon
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk
    0     80     44      -      0      0      0      0      0      0   9001   2490
    0     92     44      -     86      4      0      0      0      0   9001   2490
    0    272     53      -    100     42      0      0      0      0   9001   2145
    0    297     55      -    100     44      0      0      0      0   9001   1770
    0    295     55      -    100     44      0      0      0      0   9001   1680
    0    300     56      -    100     41      0      0      0      0   9001   1635
    0    299     56      -    100     40      0      0      0      0   9001   1755
    0    299     57      -    100     40      0      0      0      0   9001   1725
    0    301     57      -    100     43      0      0      0      0   9001   1740
    0    304     58      -    100     42      0      0      0      0   9001   1680
    0     98     49      -     86      4      0      0      0      0   9001   2490

# nvidia-smi dmon
# gpu    pwr  gtemp  mtemp     sm    mem    enc    dec    jpg    ofa   mclk   pclk
    0     80     44      -      0      0      0      0      0      0   9001   2490
    0     92     44      -     86      4      0      0      0      0   9001   2490
    0    272     53      -    100     42      0      0      0      0   9001   2145
    0    297     55      -    100     44      0      0      0      0   9001   1770
    0    295     55      -    100     44      0      0      0      0   9001   1680
    0    300     56      -    100     41      0      0      0      0   9001   1635
    0    299     56      -    100     40      0      0      0      0   9001   1755
    0    299     57      -    100     40      0      0      0      0   9001   1725
    0    301     57      -    100     43      0      0      0      0   9001   1740
    0    304     58      -    100     42      0      0      0      0   9001   1680
    0     98     49      -     86      4      0      0      0      0   9001   2490

13 Replies

Poddy•2w ago

@jphipps

Escalated To Zendesk

The thread has been escalated to Zendesk!

Jason•2w ago

any pod id?

jphippsOP•2w ago

pod: 8hh03rby46hd8s @Jason any update?

Dj•2w ago

I can forward the problematic node to the datacenter to look into. Sorry I hadn't seen this one sooner. @jphipps - The L40 GPU is capped at 300 Watts, when you're at full capacity you're hitting the power limit. To make up for this, as you've noticed the GPU slows down it's clock speed. As soon as your usage goes back down, the power GPU no longer has to accommodate a higher power draw and raises it's clock speed. You're hitting the limits of this card, and by splitting the workload across more than one GPU you won't have this issue anymore.

Lahl•2w ago

hey @Dj We're not experiencing these bottlenecks on Coreweave L40s

Dj•2w ago

Coreweave appears to give their customers 8 L40s with their bare metal product. The pod in question has one GPU connected.

Lahl•2w ago

L40 specs: - sustained power limit: 300W - base clock of 2250Mhz - boost clock of 2490Mhz so it should boost to 2490Mhz which it does, but then it should slow down to base clock of 2250Mhz, not slow down all the way down to 1650Mhz range. The design limit of the card is to run at 2250Mhz sustained so why are we being throttled?

Dj•2w ago

In my testing of a random L40 in another datacenter, I can see the following:

GPU Power Readings
    Power Draw                        : 36.32 W
    Current Power Limit               : 300.00 W
    Requested Power Limit             : 300.00 W
    Default Power Limit               : 300.00 W
    Min Power Limit                   : 100.00 W
    Max Power Limit                   : 300.00 W
Clocks
    Graphics                          : 210 MHz
    SM                                : 210 MHz
    Memory                            : 405 MHz
    Video                             : 1185 MHz

GPU Power Readings
    Power Draw                        : 36.32 W
    Current Power Limit               : 300.00 W
    Requested Power Limit             : 300.00 W
    Default Power Limit               : 300.00 W
    Min Power Limit                   : 100.00 W
    Max Power Limit                   : 300.00 W
Clocks
    Graphics                          : 210 MHz
    SM                                : 210 MHz
    Memory                            : 405 MHz
    Video                             : 1185 MHz

Max Customer Boost Clocks
    Graphics                          : 2490 MHz

Max Customer Boost Clocks
    Graphics                          : 2490 MHz

Which would indicate a sustained power limit of 300W and a base/boost that aligns with the datasheet for the L40. You can view the datasheet here (link to the pdf) or here (link to nvidia's l40 homepage) and then clicking "Download the NVIDIA L40 Product Brief".

Dj•2w ago

This is also the case when I manually run a stress test on the device, it does it's burst, then hits its power limit and provides it's lowered clock speed*

Dj•2w ago

It has been a while since I needed to make a spreadsheet so it's not super pretty but it works. I ran the same stress and logged every 500ms rather than 1s (like above) which is a little easier to see. When the GPU believed it was under stress but didn't hit its power draw cap it provided you the speed you expect, but when its under load it really can't do that and the longer the test ran the more you can see it decline. When the test is cancelled we see another burst related to finishing the job and it falls back down to the expected clock of 210.

Dj•2w ago

You can run: nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.gr,clocks.sm,clocks.mem --format=csv -lms 500 To output data fit for similar visualization.

riverfog7•6d ago

My experience is simillar. There are some GPUs in dangerous levels (not throttling but shows insanely high temps for a datacenter GPU)

Henky!!•6d ago

Also account for memory heat, that can sometimes cause seperate issues

Gaming

Programming

L40 Thermal throttling

Did you find this page helpful?