RunPod•6d ago

H100 pod not connecting to network drive of the same region

I have a dual H100 pod that's supposed to be connected to a network drive (both on CA-MTL-1), but when I try to move data, do a git status of a repo, or even start a python script residing on the network drive the terminal hangs. Seems like a network issue? I've trying to spawn dual H100 pods multiple times, but I'm getting the same IP (probably the same hardware?), so nothing changes. Trying this out from a machine with RTX A5000 works fine! Is there something I can do?

7 Replies

riverfog7•6d ago

Same problem on CA region A40 pods

constOP•6d ago

Dang, I've prepaid for that machine for a week which is currently at an unusable state (since I can't get anything off of the network drive)

Dj•6d ago

@const @riverfog7 Are you still seeing this issue? This should've been resolved in the time this thread has been open.

constOP•5d ago

i still am experiencing the issue

riverfog7•5d ago

It works now

constOP•12h ago

worked over the weekend (with some sporadic freezes) to the point where I had 3 machines (4 H100 total), they all seem to be stuck advice #1 from runpod cs

I'm sorry to hear that you're experiencing issues with your dual H100 pods on CA-MTL-1. It's indeed unusual that the issue persists with A40 pods but not with A5000 GPUs.
From what you've described, it seems like the issue might be related to the network volume in the CA-MTL-1 region. Network volumes are generally slower for read/write operations compared to direct volumes. However, the extent of the slowdown you're experiencing is not normal.
 
One possible solution is to copy the data from the network volume into the container volume and then read/write to the model from the container volume. This workaround has helped other customers with similar issues, and I believe it could be effective here as well. Also, if you happen to have any timestamps of when you noticed the slow down, that would be greatly appreciated as well.

I'm sorry to hear that you're experiencing issues with your dual H100 pods on CA-MTL-1. It's indeed unusual that the issue persists with A40 pods but not with A5000 GPUs.
From what you've described, it seems like the issue might be related to the network volume in the CA-MTL-1 region. Network volumes are generally slower for read/write operations compared to direct volumes. However, the extent of the slowdown you're experiencing is not normal.
 
One possible solution is to copy the data from the network volume into the container volume and then read/write to the model from the container volume. This workaround has helped other customers with similar issues, and I believe it could be effective here as well. Also, if you happen to have any timestamps of when you noticed the slow down, that would be greatly appreciated as well.

advice #2 from runpod cs

Thank you for your patience as we work to get this issue resolved for you on your end. Currently, the pod and machine logs on our end are showing that they they are in good standing.
 
At this time, have you been able to connect to a new pod? We believe that maybe clearing your cache or attempting a new browser might benefit you to start a pod up successfully.

Thank you for your patience as we work to get this issue resolved for you on your end. Currently, the pod and machine logs on our end are showing that they they are in good standing.
 
At this time, have you been able to connect to a new pod? We believe that maybe clearing your cache or attempting a new browser might benefit you to start a pod up successfully.

i'm not entirely certain this information is relevant to datacenter level networking issues

constOP•11h ago

even trying to directly scp a file, download stalls

Gaming

Programming

H100 pod not connecting to network drive of the same region

Did you find this page helpful?