jherrm
jherrm
RRunPod
Created by jherrm on 4/10/2024 in #⛅|pods
"We have detected a critical error on this machine which may affect some pods." Can't backup data
update - I was able to get the files! Here's what I did: 1. kept on trying to start up the pod, but with the base CPU docker image and 0 GPUs so that I wouldn't be burning money 2. eventually the pod got past the "create pod network" stage and started up successfully (this took at least 20 min but I only noticed an hour later) 3. since SSH was still not working, I used the web terminal and was able to get in 4. BUT, the pod was still in a super weird state which didn't let me install any packages like gcloud to transfer the files to a bucket. 5. So I started a python web server at port 8888 (python3 -m http.server 8888) 6. zipped up my files and downloaded via a web browser
3 replies
RRunPod
Created by jherrm on 3/16/2024 in #⛅|pods
torch.cuda.is_available() is False
thanks @Papa Madiator for the quick response. I just spun up the instance again and ran your tool (no other runtime changes were made or programs ran). Here is the contents of the gpu_diagnostics.json file
8 replies