Daniel T.
Daniel T.
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
runpodctl send times out/fails for large files when transferring data. This is a relatively common problem for approaches that do not have enough retries or networking instability. rsync and wget work due to robust retries. Given the inability to obtain a pod in the given geographical region, I'm giving up on the approach of transferring data via two connected pods and will pay the cloud provider fees.
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
Trying to rent another machine to transfer the data over, but running into spending limits
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
Could you increase my spending limit?
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
Could you increase my spending limit when you have a chance?
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
No description
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
No description
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
Transferring between pods should work - ty
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
~3TB, we have a > 16 TB we want to transfer for when we fully start training.
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
Thank you for the update and for crediting the account. Right now it seems like two GPUs are in the error state. Do you have any clue regarding the timing for the tech to fix the issue? Would you recommend spinning up a new instance, or waiting for the issue to be fixed by the tech? For reference, it takes ~24 hours to transfer data and egress costs are substantial.
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
Lmk when you have an update
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
Got it. I've seen this happen with cables not being plugged in fully or a GPU getting overheated significantly, hopefully not RMA 🙏
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
Yes -- as mentioned above we have terabytes of data we paid to transfer to the machine on a persistent volume disk
27 replies
RRunPod
Created by Daniel T. on 2/29/2024 in #⛅|pods
We have detected a critical error on this machine which may affect some pods.
We did not use network storage as we were unable to find any availability for H100s. Pod ID: brpigbe2fzkzrh
27 replies