error creating container
error creating container: nvidia-smi: parsing output of line 0: failed to parse (pcie.link.gen.max) into int: strconv.Atoi: parsing "": invalid syntax

59 Replies
hi
what gpu r u using
3090
what image..?

i also got same error for 2.4.0
community cloud or secure cloud?
ON-demand
mine works fine
i changed Container Disk volume to 200
any changes in CMD?
so it chnages from spot to on-demand
no
can you try recreating the thing
can you come on vc and help me out
mine works with a 3090 and that template
i cant speak here
on lounge vc?
im in a library
soo
u just be mute and see
and type for anything wrong
please
okay
why tho
mine starts fine
i tried diffrent template also
i want to train a cv model
so thats why i am using it
other gpus too?
try A40
maybe that specific 3090 has a problem
yeah
so which is the best gpu
depends on the model
i can use for training my cv model
what's your model
like big or small
if its small and you wanna be cost effective
A40 / 3090 / 4090
sth like this
if its large and doesnt fit in those gpus
or u wanna do large batch sizes
then go with A100 / H100 / H200 / B100(this is overkill tho)
@Anmol Sharma did u solve it
@Dj sorry for the ping but I think that
nvidia-smi: failed to parse pcie.link.gen.max
is a HW error. can you check that?ya sorry something came up , had to go out of my room
ya i will use a40 then
@Anmol Sharma
Escalated To Zendesk
The thread has been escalated to Zendesk!
Ticket ID: #16220
can you open a ticket for that
okay
i have one more problem
so when i upload my dataset it is crashing
lets just talk here
ok
how large is the dataset?
in gigabytes?
yes
i want to use 3 datasets
and what are you uploading with??
one is 32gb , 14gb and 5gb
hmm..
did you provision enough storage?
where are you uploading to? /workspace?

have a look to this
how much images are in your dataset?
@Anmol Sharma Can you share the pod id of the pod that gave you the error at the top of this thread?
umm i dont have a note of that
you can check at audit logs
give me a sec
sorry for typo
I can grab it too, it's just easier if you already know the id :p
If you can't track it down let me know I'll go find it in your account history.

yeah please help me out
where r u uploading?
in a folder i am creating
and maybe use a network volume as it is persistent and not subject to dataloss
it needs to be in workspace
ya ya
and make volume disk 200gigs
and container disk 20gigs
in that i am creating a a sub folder
volume disk -> /workspace
container disk - >everything else
let me try and i will get back to you
i am getting s much less speed that turtle is faster, i have a running pod for last 1 hour and still the dataset is being uploaded which is 5 gb, only 517.00MB have been uploaded so far

and it exited
its a spot pod
use on demand if you want it to never exit
let me try it
I am also suffering from this phenomenon.
The image is a self-made image based on
nvidia/cuda:12.6.2-cudnn-runtime-ubuntu24.04
.
The information I have is
- It occurs in serverless
- RTX3090
- With the same image, this problem sometimes occurs and sometimes does not (if it does occur, it goes into an infinite loop, so I end it with "Terminate")
- It seems to have started occurring recently in EU-CZ-1
- I think it has been occurring frequently in the US for some time nowI have a feeling this might relate to cuda version, you want to filter machine has 12.6+