feesta
feesta
RRunPod
Created by feesta on 3/24/2025 in #⛅|pods
Cuda not connecting to image provisioned for GPU
Started a community pod with 1 GPU (4090) using the Runpod pytorch image/template (runpod/pytorch:2.4.0-py3.11-cuda12.4). Immediately after starting pod, GPU is unavailable even though nvidia-smi seems to see the GPU. This is happening about 20% of the time I start images with this official container. No errors thrown in system or container logs. root@5c367a0d4ea2:/# python -c "import torch; print(torch.cuda.is_available())" /usr/local/lib/python3.11/dist-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 False root@5c367a0d4ea2:/# nvidia-smi Mon Mar 24 15:59:01 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off | | 0% 26C P8 11W / 450W | 2MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ (abridged due to message length)
5 replies
RRunPod
Created by feesta on 12/30/2024 in #⛅|pods
Error creating temporary lease
Error starting container. Happens repeatedly on this pod. (Host: q2jrr78mge01co) 2024-12-30T17:26:00Z create 60GB volume 2024-12-30T17:26:00Z create container runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 2024-12-30T17:26:00Z error pulling image: Error response from daemon: error creating temporary lease: write /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db: no space left on device: unknown 2024-12-30T17:26:01Z start container for runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04: begin 2024-12-30T17:26:01Z error starting container: Error response from daemon: write /var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db: no space left on device: unknown
3 replies
RRunPod
Created by feesta on 11/17/2024 in #⛅|pods
"There are no longer any instances available with the requested specifications."
No description
3 replies
RRunPod
Created by feesta on 11/11/2024 in #⛅|pods
Error when synching with Backblaze
I'm getting "Something went wrong!" most of the time when syncing with Backblaze. It sometimes works so doesn't seem to be an issue with credentials. No other info in the error popup.
3 replies
RRunPod
Created by feesta on 10/15/2024 in #⛅|pods
Volume with no files is registered as having 23GB
Started a new pod and /workspace had unusual amount of data used for the pod. Deleting everything from the volume still shows substantial usage. Seems like a bug in calculating storage. root@5c5eadaefa32:/workspace# df -h Filesystem Size Used Avail Use% Mounted on overlay 10G 64M 10G 1% / tmpfs 64M 0 64M 0% /dev /dev/sdb 26G 23G 4.0G 85% /workspace shm 17G 0 17G 0% /dev/shm /dev/mapper/ubuntu--vg-ubuntu--lv 38G 21G 16G 58% /usr/bin/nvidia-smi tmpfs 95G 0 95G 0% /sys/fs/cgroup tmpfs 95G 12K 95G 1% /proc/driver/nvidia tmpfs 95G 4.0K 95G 1% /etc/nvidia/nvidia-application-profiles-rc.d tmpfs 95G 0 95G 0% /proc/asound tmpfs 95G 0 95G 0% /proc/acpi tmpfs 95G 0 95G 0% /proc/scsi tmpfs 95G 0 95G 0% /sys/firmware root@5c5eadaefa32:/workspace# ls -al total 0 drwxr-xr-x 2 root root 6 Oct 15 17:48 . drwxr-xr-x 1 root root 90 Oct 15 17:41 .. root@5c5eadaefa32:/workspace# root@5c5eadaefa32:/workspace# du -sh . 0 .
6 replies
RRunPod
Created by feesta on 6/19/2024 in #⛅|pods
Cloud Sync False "Something went wrong" and secrets fail
When using Cloud Sync with Backblaze, I'm having 2 problems. First: if using secrets, it gives no feedback when I click "Copy from Backblaze B2". I have tried this repeatedly on different pods and with re-created secrets. I'm calling the secrets like: {{ RUNPOD_SECRET_BB_app_id }} I would expect at least an error that the request was rejected or something so I can fix the problem. Second problem is more minor: If I fill the fields for the Backblaze and click Copy From, it pops up the error "Something went wrong". Canceling returns to the submission box. If I escape that, I see that was actually successful in initializing the copy. Seems the UI is falsley receiving an error even when the operation was successful.
7 replies