Template pytorch-1.13.1 lists cuda 11.7.1 version but is actually cuda 11.8?
I tried running a model that requires pytorch-1.13.1 and 11.7 but it said the cuda version doesn't match (the pod is actually on 11.8). The mismatch check happens in the deepspeed package.
I tried starting up a new pod with the same template and did
nvcc --version
and it said the pod was on cuda version 11.8.
Is this normal or an error? I can't seem to run my model because of the cuda version mismatch. For reference, I'm using A40.7 Replies
Most or template are made from 11.8 up to 12.2
Oh I see, thanks. So is the template name wrong then, since it incorrectly says it uses cuda 11.7?
hmm yeah sure seems like a misnaming but use whichever works for you
cuda version mismatch ? what its like
This was the template:
In the template it mentions "cuda11.7.1", and pytorch 1.13.0 is built for cuda 11.6 or 11.7 so this makes sense
But the actual pod uses 11.8
If you check
nvcc --version
Other templates have the correct CUDA version installed (the one listed in the template/image id)
And the mismatch error happens due to a CUDA version check in what I believe is the deepspeed library, I can try to pull up the error if you would like to see
Basically, I don't think pytorch 1.13.1 is meant to be run on CUDA 11.8
I think there's a workaround by installing pytorch manually by building the library from scratch, but that doesn't seem worth the hassle, and also defeats the point of the template
But yeah, this was confusing so I thought to bring it up here to see if anyone else had this happenyep, but template is reusable so you wouldn't be "reinstalling" or remaking templates whenever you want to run your pods again
btw @meerkat i'd suggest browsing ngc or nvidia's container registry they have the same pytorch docker image with nvidia drivers too
Ooh that's a great idea, thanks so much