RunPod•8mo ago

Monitor GPU VRAM - Which GPU to check?

I am trying to monitor the GPU VRAM usage in serverless worker. To do this with pynvml I need to provide the index of the GPU. Is there a way I can obtain the index of the GPU my worker is using? I did not see this info in the ENV variables. I do see RUNPOD_GPU_COUNT but not sure if that helps. Seems that RunPod is monitoring cpu, gpu stats as they present that information in their web interface. Does the RunPod python module expose those stats, without having to code our own? Below is a code snippet that reports VRAM usage in a %.

import pynvml
import time

# Initialize NVML
pynvml.nvmlInit()

handle = pynvml.nvmlDeviceGetHandleByIndex(0)  # Assuming you have only one GPU

while True:
    # Get the memory information for the GPU
    memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)

    used_vram = memory_info.used // (1024 ** 2)  # Convert bytes to MB
    total_vram = memory_info.total // (1024 ** 2)  # Convert bytes to MB
    vram_usage_percentage = round((used_vram / total_vram) * 100)

    print(f'vram usage: {vram_usage_percentage}%')

    time.sleep(5)

import pynvml
import time

# Initialize NVML
pynvml.nvmlInit()

handle = pynvml.nvmlDeviceGetHandleByIndex(0)  # Assuming you have only one GPU

while True:
    # Get the memory information for the GPU
    memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)

    used_vram = memory_info.used // (1024 ** 2)  # Convert bytes to MB
    total_vram = memory_info.total // (1024 ** 2)  # Convert bytes to MB
    vram_usage_percentage = round((used_vram / total_vram) * 100)

    print(f'vram usage: {vram_usage_percentage}%')

    time.sleep(5)

Thanks! 🙂

16 Replies

EncyrptionOP•8mo ago

Maybe I could use GraphQL with PodTelemetry? Where's my GraphQL experts at? 😉

Jason•8mo ago

I've never used graphql before, is the index not starting from 0? Im not clear yet, what kind of index are you looking for?

EncyrptionOP•8mo ago

If I assume that my worker is using gpu at index 0. If there are multiple GPU in the server that might not be accurate. I might be on GPU 3 and another worker using GPU 0. I am pretty sure I can get that info with GraphQL. I should be able to query by pod ID and it has PodTelemetry in the return, which contains cpu and gpu stats. I'm just struggling with the documentation for it.

Jason•8mo ago

Oh can you figure out whats the index sorted from? like whats sorting the index https://graphql-spec.runpod.io/#definition-PodTelemetry

EncyrptionOP•8mo ago

Yeah, I've seen that. I'm still looking for a good example of making a graphql request.

Jason•8mo ago

query pod($input: PodFilter) {
  pod(input: $input) {
    latestTelemetry {
      state,
time,
memoryUtilization 
averageGpuMetrics {
id,
powerWatts,
memoryUtilization,
percentUtilization 
}
    }

query pod($input: PodFilter) {
  pod(input: $input) {
    latestTelemetry {
      state,
time,
memoryUtilization 
averageGpuMetrics {
id,
powerWatts,
memoryUtilization,
percentUtilization 
}
    }

srry bad formatting use your own input

EncyrptionOP•8mo ago

I would need to provide the pod id

Jason•8mo ago

yes correct

EncyrptionOP•8mo ago

So what do I do? add podId: ${pod_id} to inupt?

Jason•8mo ago

{"input": {"podId": "MYPODID"}}

            "runtime": {
                "uptimeInSeconds": 135,
                "gpus": [
                    {
                        "id": "GPU-26e2eb9c-c0f5-9870-687c-28cdec1a68ea",
                        "gpuUtilPercent": 0,
                        "memoryUtilPercent": 0
                    }
                ]
            },
            "latestTelemetry": {
                "individualGpuMetrics": [
                    {
                        "id": "GPU-26e2eb9c-c0f5-9870-687c-28cdec1a68ea",
                        "temperatureCelcius": 33,
                        "percentUtilization": 0,
                        "memoryUtilization": 0,
                        "powerWatts": 74
                    }
                ],

            "runtime": {
                "uptimeInSeconds": 135,
                "gpus": [
                    {
                        "id": "GPU-26e2eb9c-c0f5-9870-687c-28cdec1a68ea",
                        "gpuUtilPercent": 0,
                        "memoryUtilPercent": 0
                    }
                ]
            },
            "latestTelemetry": {
                "individualGpuMetrics": [
                    {
                        "id": "GPU-26e2eb9c-c0f5-9870-687c-28cdec1a68ea",
                        "temperatureCelcius": 33,
                        "percentUtilization": 0,
                        "memoryUtilization": 0,
                        "powerWatts": 74
                    }
                ],

it'll be something like this MAybe u should use

 latestTelemetry {
      individualGpuMetrics {

 latestTelemetry {
      individualGpuMetrics {

EncyrptionOP•8mo ago

That's great, thanks! I was going to send that data over the web socket but this is much better. I can just have the browser call this once a second and update CPU/GPU graph. 🙂

Jason•8mo ago

nice hahah oh wait what you're building ? cpu graph 🤔

EncyrptionOP•8mo ago

Yeah, I think It is really coming along. Everything works just need to update the CPU/GPU graph and display the result media.

Jason•8mo ago

wew a tooncrafter app cool

EncyrptionOP•8mo ago

ToonCrafter is just one in the market... I will likely try and add a lot of models before going live. My code builds the interface dynamically so should be able to add them pretty fast.

EncyrptionOP•8mo ago

Gaming

Programming

Monitor GPU VRAM - Which GPU to check?

Did you find this page helpful?