Pods shutting down
Is it normal behaviour for a GPU cloud pod that is paid to be on 24/7 to require a cold boot every time it hasn't been used for a while? We have been paying for a GPU to be on all the time so it is quick to respond when we do demos of our software and it's always slow because the pod has to boot up
Connection unexpectedly abort
We are running an GRPC server inside runpods and 1~2% of request abort unexpectedly. Our API's log complain that downstream disconnect and I suspect RunPod NAT abort connection in certain situation. Is there any connection timeout or other policy for TCP connection, or is it just an unstability of runpod infrastructure?
Downloading file/directory from remote to local using SCP
Hi when trying to download from the remote I get a password request is there a workaround ?
POD's ERRORS :((((((
This server has recently suffered a network outage and may have spotty network connectivity. We aim to restore connectivity soon, but you may have connection issues until it is resolved. You will not be charged during any network downtime.
MY IDs:
g0htfaz7oe0lht
brr2em0266otas...
Nvidia driver version
Where can I see what driver versions pods use? Is it the same for all GPU types?
I get this error even when selecting cuda 12.3
ERROR: This container was built for NVIDIA Driver Release 545.23 or later, but version 535.154.05 was detected and compatibility mode is UNAVAILABLE....
Profiling CUDA kernels in runpod
Hi! I'm trying to profile my kernel with nsight-compute and I'm getting error : "==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0."
Which is explained on this page : https://developer.nvidia.com/nvidia-development-tools-solutions-err_nvgpuctrperm-permission-issue-performance-counters
and has to be fixed on the host side. Anybody found a workaround for this issue or how to solve it? Thanks!...
Inconsistency with volumes
We have an issue where when we startup a container/pod we run a script that should exists inside of a volume (container volume). Where we get inconsitencies is that sometimes it seems the volume has not been connected yet so the build error fails, why might this be? Also I'd really appreciate if the messages in my purple circle be attended to as theres a bunch of context in there
No availability issue
When renting some instances, the main screen says 'High availability', or etc.. yet it has none when you actually set a good storage amount (eg 100GB).. Why does a 2xH100 not have more than 34GB of storage availability? There should definitely be a requirement if you host on community cloud because thats nuts.
L40 and shared storage
For my workloads I want to use a L40, but I also need shared storage. Do I get it right, that this is not possible right now, because storage is only available in data-centers that do not offer L40? Is there a roadmap, if and when this may change?
Run container only once
Hi everyone,
I want to run a container for a single life-cycle only (i.e. my container is designed to terminate after it is finished, it doesn't run forever).
However, the container will constantly restart on termination, and there doesn't seem to be a config option to stop this (I have to manually terminate/delete the instance).
Is there a way to specify this desired behavior?...
Clone a Runpod Networkvolume
Hi! Is there some way to clone a Network Volume in the Runpod interface or is this something i have to do via some scp magic? Thanks in advance!
Insufficient Permissions for Nvidia Multi-GPU Instance (MIG)
I was planning to test some new Nvidia GPU features using a pod with Nvidia A100 80G.
I tried
nvidia-smi -mig 1
as a root user but I got the output below
```...Automatic1111 - Thread creation failed: Resource temporarily unavailable
Hello, we started to get this error more often. Normally we were getting it time to time, and after restarting sd web ui, everything was working fine, but now each 10 minutes, it throws this error and kills the web ui server.
libgomp: Thread creation failed: Resource temporarily unavailable
Do we know what this error means exactly? would really appreciate if someone can help.
...How can I view logs remotely?
Hi! I am ttrying to view the logs of a training build I am doing but it seems to stop here. The container is still up and running and movement happens in the GPU. How can I view the logs outside of the web interface?
Solution:
you can connect to the container and view the full log files over SSH (they are in /var/log)
change the GPU pod type without recreating
Is there an option available whereby, if the previous GPU becomes unavailable, I can select another GPU type without recreating it? I'm confused because I need to ensure that my pod ID remains fixed.
l40s "no ressources available"
Hello everytime i try to choose a l40s i keep getting a "no ressource avilable" message. There are many l40s available in community cloud, but i cant choose them, no matter how little disk space i assign to it.
Hi Runpod team is the AttributeError Gradio issue resolved?
Is that issue resolved? I am not able to run the TheBloke text gen WEBUI for the past day not to mention I am still getting charged even though I am not able to access it please let me know if this issue is resolved or when it will get resolved.
Solution:
I figured it out. You run "pip install --upgrade gradio" in the web terminal to fix
permission problems with ooba and textweb ui containers
been having a few issues with permissions on new installs of both ooba and text webui installs
TCP port external mapping keeps changing every time pod restarts.
I’m setting up my remote development environment using ssh sftp. Everything works fine. However, every time the pod restarts, the external mapping for port 22 keeps changing, which makes my local IDE unable to connect to the remote because now it’s on a different port. Is there a way to fix this? Or any workaround if it’s by design? Thanks in advance.