DreamGen
DreamGen
RRunPod
Created by DreamGen on 5/24/2024 in #⛅|pods
Network issue ETA?
Several of my podst got hit with This server has recently suffered a network outage and may have spotty network connectivity. We aim to restore connectivity soon, but you may have connection issues until it is resolved. You will not be charged during any network downtime. including e.g. 82mr3meakiiytt Do you have ETA for the fix? They are still not back up.
5 replies
RRunPod
Created by DreamGen on 5/19/2024 in #⛅|pods
Feature Request: `runpodctl send` TO specific machine & folder (ala SCP)
This can be achieved today by running:
runpodctl send foo
ssh machine 'cd /workspace && runpoctl receive ...'
runpodctl send foo
ssh machine 'cd /workspace && runpoctl receive ...'
But it would be great to just be able to do:
runpodctl send foo --dest machine:/workspace
runpodctl send foo --dest machine:/workspace
17 replies
RRunPod
Created by DreamGen on 4/20/2024 in #⛅|pods
4xH100 pod is stuck -- can't restart or stop
No description
6 replies
RRunPod
Created by DreamGen on 4/17/2024 in #⛅|pods
A6000 price change based on # GPUS?
Steps to reproduce: 1. Go to community cloud 2. Select A6000 (price 0.69/hr) 3. Change count to 2 (price 1.58/hr -- which is 0.79/hr per gpu!) 4. Change count back to 1 (price stays 0.79/hr) So two questions: 1. Since when do you increase price when you rent 2 GPUs? 2. Why does the price stay 0.79/hr after reducing count from 2 to 1?
3 replies
RRunPod
Created by DreamGen on 3/16/2024 in #⛅|pods
UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda
This is a reocurring problem on RunPod. This time with 3090 -- tried 3 different pods in CA region (can't use US region because it has maintenance soon...). ID: wmwxn9onlckqus
root@fd08183704a5:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
root@fd08183704a5:~# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
root@fd08183704a5:~nvidia-smi
Sat Mar 16 07:26:26 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
root@fd08183704a5:~nvidia-smi
Sat Mar 16 07:26:26 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.1 |
|-------------------------------+----------------------+----------------------+
root@fd08183704a5:~# python
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.1.1+cu121'
root@fd08183704a5:~# python
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.1.1+cu121'
5 replies
RRunPod
Created by DreamGen on 2/25/2024 in #⛅|pods
Broken CUDA / PyTorch on H100
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
Tried reinstalling PyTorch, did not help.
26 replies
RRunPod
Created by DreamGen on 2/15/2024 in #⛅|pods
Reserving pods on different machines
Hey there, 4 of my long running pods have a scheduled maintenance at the same time. I would like to spin up new pods before then to cover for that, but how can I make sure the new pods won't be on the same machine and also undergo maintenance before starting them?
3 replies
RRunPod
Created by DreamGen on 2/4/2024 in #⛅|pods
Any recent firewall changes?
Were there any recent firewall changes in the last few days? Seeing urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) when interacting with HF hub. Replicated by other people as well, on different machines.
4 replies