R
RunPod6d ago
const

H100 pod not connecting to network drive of the same region

I have a dual H100 pod that's supposed to be connected to a network drive (both on CA-MTL-1), but when I try to move data, do a git status of a repo, or even start a python script residing on the network drive the terminal hangs. Seems like a network issue? I've trying to spawn dual H100 pods multiple times, but I'm getting the same IP (probably the same hardware?), so nothing changes. Trying this out from a machine with RTX A5000 works fine! Is there something I can do?
7 Replies
riverfog7
riverfog76d ago
Same problem on CA region A40 pods
const
constOP6d ago
Dang, I've prepaid for that machine for a week which is currently at an unusable state (since I can't get anything off of the network drive)
Dj
Dj6d ago
@const @riverfog7 Are you still seeing this issue? This should've been resolved in the time this thread has been open.
const
constOP5d ago
i still am experiencing the issue
riverfog7
riverfog75d ago
It works now
const
constOP12h ago
worked over the weekend (with some sporadic freezes) to the point where I had 3 machines (4 H100 total), they all seem to be stuck advice #1 from runpod cs
I'm sorry to hear that you're experiencing issues with your dual H100 pods on CA-MTL-1. It's indeed unusual that the issue persists with A40 pods but not with A5000 GPUs.
From what you've described, it seems like the issue might be related to the network volume in the CA-MTL-1 region. Network volumes are generally slower for read/write operations compared to direct volumes. However, the extent of the slowdown you're experiencing is not normal.

One possible solution is to copy the data from the network volume into the container volume and then read/write to the model from the container volume. This workaround has helped other customers with similar issues, and I believe it could be effective here as well. Also, if you happen to have any timestamps of when you noticed the slow down, that would be greatly appreciated as well.
I'm sorry to hear that you're experiencing issues with your dual H100 pods on CA-MTL-1. It's indeed unusual that the issue persists with A40 pods but not with A5000 GPUs.
From what you've described, it seems like the issue might be related to the network volume in the CA-MTL-1 region. Network volumes are generally slower for read/write operations compared to direct volumes. However, the extent of the slowdown you're experiencing is not normal.

One possible solution is to copy the data from the network volume into the container volume and then read/write to the model from the container volume. This workaround has helped other customers with similar issues, and I believe it could be effective here as well. Also, if you happen to have any timestamps of when you noticed the slow down, that would be greatly appreciated as well.
advice #2 from runpod cs
Thank you for your patience as we work to get this issue resolved for you on your end. Currently, the pod and machine logs on our end are showing that they they are in good standing.

At this time, have you been able to connect to a new pod? We believe that maybe clearing your cache or attempting a new browser might benefit you to start a pod up successfully.
Thank you for your patience as we work to get this issue resolved for you on your end. Currently, the pod and machine logs on our end are showing that they they are in good standing.

At this time, have you been able to connect to a new pod? We believe that maybe clearing your cache or attempting a new browser might benefit you to start a pod up successfully.
i'm not entirely certain this information is relevant to datacenter level networking issues
const
constOP11h ago
even trying to directly scp a file, download stalls
No description

Did you find this page helpful?