Extremely slow network storage
For a couple of weeks now, it's taken >15 minutes (sometimes up to 20) to load a 70B model from network storage in CA-MTL. This is, of course, paid time on the GPU while waiting for it (as well as being quite inconvenient) - it's actually quicker to download the model fresh from an internet server every time rather than waiting for it to load from network storage. Is the "network storage" actually on the same network as the GPU server, or is it on some cloud somewhere? Why is it so slow?
5 Replies
hey, I meet the same problem here at CA-MTL-1, hopes to be fixed fast!
this is my pod id: ID: d4cfdt3tvtinod
today the model training is delayed by the terrible model loading time, and then my kohya-ss training is killed halfway for the log writing failure , for three times, this never happened before but today I can't have a complete training process even once.
I am a faithful user of yours, and really need it to be fixed soon!
I wonder should I keep the d4cfdt3tvtinod for you debug
Same problem in EU-SE-1
Also happens with new pods
No i think you can just stop it
Along you can reference the ID it's good
@yhlong00000
No it's in a different servers most likely but interconnected in the same dc
It's weird how I can download the same model from a VM in the Netherlands faster than from network storage in the same DC
Yeah maybe the bandwidth for network storage is slower or some reasons because the architecture, I'm not sure but staff can debug and give more information
Guys try to open a support ticket to report too