jherrm
jherrm
RRunPod
Created by jherrm on 4/10/2024 in #⛅|pods
"We have detected a critical error on this machine which may affect some pods." Can't backup data
No description
3 replies
RRunPod
Created by jherrm on 3/16/2024 in #⛅|pods
torch.cuda.is_available() is False
Spinning up several H100s (burning money 😅) and no matter which official docker image I use, torch.cuda.is_available() is always False, which prevents me from actually using these GPUs. I've tried the following docker images: pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04 pytorch:2.0.1-py3.10-cuda11.8.0-devel-ubuntu22.04 Output of various commands:
print(torch.version.cuda)
12.1
print(torch.cuda.device_count())
7
print(torch.cuda.is_available())
False

nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
print(torch.version.cuda)
12.1
print(torch.cuda.device_count())
7
print(torch.cuda.is_available())
False

nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
If the official images don't work, what am I supposed to do? Thanks for any help!
8 replies