Cannot Install JAX
Hello, I am currently unable to properly install JAX on both the A100 SXM 80GB and the H100 80GB SXM5 in the Secure Cloud. When I run the command
pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
I get the following error (partially shown) that there are dependency conflicts with torch:...
Stable Diffusion Stopped Working After a Restart
Stable Diffusion was running fine until it stopped (unfortunately I get no notification to put funds back on the account). But after restarting it there was an odd error. I made no changes that I know of. I am a noob and I didn't even set this up. How do I find help in getting this back up and running?
Losing all important data in /workspace folder while pod is running :(
I am running a pod with A100 SXM 80G and an attached workspace. But suddenly I lost every content in the /workspace folder while maintaining the connection. I have very important data in the workspace that I am working on. Could any staff member kindly look into how this happens?
Installing Bittensor?
Hello. I am trying to install Bittensor 6.9.3. But it seams like it is not working at the moment. Anyone having the same issues?
Connectivity issue on 4090 pod
Hello Runpod,
I've been unable to access a stopped 4090 pod for quite some time now (approx 10-12 hrs). The pod ID is 24kw7y5uu2yuil, in the
IS
datacenter. During this time, the attached notice about a network outage has been displayed for that pod, and the process to launch the pod gets stuck at Waiting for logs
as in the second attached image. This happens when trying to launch with any number of the pod's GPUs (0-8 inclusive). I do not need to use this pod's GPUs but I do have some important data I need to transfer from it. I've been waiting to post something about this since I've been assuming the network issue is transient, but as it's been happening since before I went to bed last night, I figured I would reach out to see if there's any way I can get the data off of this pod.
Thanks!...P2P is disabled between NVLINK connected GPUs 1 and 0
Hey team! Could you fix NVLink issue for H100 SXM Community pods? I encounter this error frequently. Corrupted pod ID: 4a5acwxj2kene6
P2P is disabled between NVLINK connected GPUs 1 and 0. This should not be the case given their connectivity, and is probably due to a hardware issue. If you still want to proceed, you can set NCCL_IGNORE_DISABLED_P2P=1.
I can proceed with NCCL_IGNORE_DISABLED_P2P flag but this will drop performance ~ 10%...
Solution:
@storuky2306 so got response and aparently gpu5 is not supporting P2P.
What we can advise for now is to pick diffrent machine...
Pod with different IPS?
Hi team, how do I create 2 pod with different IP address? I have 2 right now but they have the same IP address which does not meet my use case
Solution:
You can't get different IPs just different ports.
No GPU Available
Hello, I have using the 'runpodctl' workflow with dev and deploy commands and I have one network volume attached in EU-RO-1 zone. Currently whenever I run 'runpodctl project dev', CLI says that there are no GPUs available. Is it true or actually no GPU is left? Here is the part related inside my 'runpod.toml' file:
base_image = "runpod/base:0.6.1-cuda12.2.0"
gpu_types = [
"NVIDIA GeForce RTX 4080", # 16GB...
Find Config of Deleted Pod
Is there any way to view the config of a pod that was already deleted?
I see the pod creation listed in audit logs, but I don't see any way to view the configuration that I used when creating the pod.
Of course the pod volume, data, logs, etc. will be permanently deleted, but it would be nice if the initial config was logged in the audit logs too, so a similar pod could easily be created in future....
Solution:
You can't get the config for a deleted pod. Best is to use templates.
torch.cuda.is_available() is False
Spinning up several H100s (burning money 😅) and no matter which official docker image I use,
torch.cuda.is_available()
is always False
, which prevents me from actually using these GPUs.
I've tried the following docker images:
pytorch:2.2.1-py3.10-cuda12.1.1-devel-ubuntu22.04...UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda
This is a reocurring problem on RunPod.
This time with 3090 -- tried 3 different pods in CA region (can't use US region because it has maintenance soon...).
ID: wmwxn9onlckqus
...
Solution:
You need to use the CUDA filter to select the correct CUDA version. CUDA is not forwards compatible. You need to select a machine that matches the CUDA version of your Docker image. The machine can have a higher version then your Docker image but not a lower version. CUDA is backwards compatible but not forwards compatible.
Latest version of Automatic1111 in 'RunPod Automatic1111 Stable Diffusion Template t '
is there a way to update to the latest version of Autometic1111 in 'RunPod Automatic1111 Stable Diffusion Template'
Solution:
There is but too much effort, rather just use a community template.
How to stop a Network Disk
I setup a pod as a network disk so I could pause it and connect to a different server if needed (as opposed to the standard pods that can only connect to the same server and usually can't), but I'm unsure how to stop it. I don't see the stop button in the UI. I've discovered I cannot connect to the Web Terminal due to a bug, so uploading models and input files has broken my workflow and I want to save the process I have so far before my $ runs out.
Solution:
you can attach network storage to new pod
Pod Downsized, with Pictures
I have been having issues with the A6000's, my pods keep getting downsized. I wasn't able to catch a snip of the first couple times but I did here. As you can see I am renting 2 A6000's but when I try to start it I am only renting 1. What is happening here? this keeps on happening so I have to terminate and start new pods repeatedly, my entire audit logs is me restarting new A6000's and I keep losing money trying to get the service to give me what I bought. (ID:wp0ofb7xs94uf0)
Solution:
Hi @MushyPotato - when you rent a pod on spot and it gets stopped (really, when a pod gets stopped for any reason) the GPUs in that machine are made available for other customers, and by the time you can start it again there may be fewer GPUs available.
Setting up a network volume will allow you to deploy to any GPU within that data center, so you will not be limited by whatever the GPU rental status in your specific machine is....
I'm pretty sure I've been getting pods where "/" lives on a network disk
Which makes data reading impossible and the pod useless.
I have a little 'see how fast you can read the database' script: It does 6000 iterations/second in my home machine, reading from an SSD. It gets something like 600 iterations/second in "normal" pods, reading my data from /....
I have a little 'see how fast you can read the database' script: It does 6000 iterations/second in my home machine, reading from an SSD. It gets something like 600 iterations/second in "normal" pods, reading my data from /....
Question about graphql API
In https://doc.runpod.io/recipes/view-gpu-types-info
If securePrice is zero does that mean that the resource is not available?...
Solution:
@ChD That's correct - likely the A100 SXMs were all in use at the time that you ran the query, globally out of all specific GPU specs they're the most likely to go completely unavailable at the moment
There are some now and it's showing the price for me: "id":"NVIDIA A100-SXM4-80GB","securePrice":2.29}...
Create new pod with runpodctl
I'm trying to create a pod with runpodctl. It appears by reading the --help that I cannot create a pod using network storage for /workspace? I didn't find the correct option to pass. Maybe with --args ?
Bonus point: how can I create a pod with specific requirements? Eg: Start a pod with 48 GB of VRAM with cost less than $1/hr. It could start a pod with 2xA5000 or 1xA6000 depending of available resources....