"We have detected a critical error on this machine which may affect some pods." Can't backup data
During a training run with 8xH100, I started seeing strange "Directory not found" errors in my jupyter notebook which could not be dismissed (they kept popping up). Although my training run continued and completed, I wasn't able to copy the data off of the volume disk due to the modals blocking operation.
I looked into the deployment and saw the error "We have detected a critical error on this machine which may affect some pods. We are looking into the root cause and apologize for any inconvenience. We would recommend backing up your data and creating a new pod in the meantime."
Unfortunately everything I've tried to get my data doesn't work - reconnecting to the notebook, Web Terminal, SSH (both options), and even stopping and starting the pod fails.
When trying to start the pod again, it stalls on
create pod network
.
How do I get my data!?2 Replies
@jherrm send me pod id
update - I was able to get the files! Here's what I did:
1. kept on trying to start up the pod, but with the base CPU docker image and 0 GPUs so that I wouldn't be burning money
2. eventually the pod got past the "create pod network" stage and started up successfully (this took at least 20 min but I only noticed an hour later)
3. since SSH was still not working, I used the web terminal and was able to get in
4. BUT, the pod was still in a super weird state which didn't let me install any packages like gcloud to transfer the files to a bucket.
5. So I started a python web server at port 8888 (
python3 -m http.server 8888
)
6. zipped up my files and downloaded via a web browser