Network Volume Integrity
Ever since last night every pod I deploy on my network volume:
fpomddpaq0
there are certain files that I cannot open (I believe they have been corrupted). I get a 'launcher error 524' (timeout) when I try to open these specific files (.ipynb). I have tried changing images to the latest pytorch image but that did not help. I have cross checked with a fresh volume in the same region and the error does not occur there. I have now confirmed the issue using the file
command via web terminal but it causes a timeout when trying to read those files, but not any other others. I am writing this post as those files had a lot of code that I will now have to rewrite from bits and pieces, a big waste of time. I am quite annoyed at this and am informing to prevent future incidents. For some additional context, I was running a CPU-intensive training and all of a sudden I was getting no response from the pod (there was a yellow exclamation warning on it on the pod deployments page) so after a while of waiting (an hour) I terminated the pod, and then when I tried to redeploy I couldn't (waiting for logs
) so I slept on it and when I woke up the corruption was there.1 Reply
@Nafi
Escalated To Zendesk
The thread has been escalated to Zendesk!