so i m wondering what it looks like to
so, i'm wondering what it looks like to modernize
ublue-nvidia
...?29 Replies
i'll thread it 😄
a few things have been brought up
1) we install an selinux policy ... and it provides an selinux domain to use in the container label to use nvidia stuff without full on disabling selinux (at least for the container runtime)
(example:
podman run --security-opt label=type:nvidia_container_t --rm nvcr.io/nvidia/cuda nvidia-smi
)
but... @akdev has comments which suggest possibly removing it...
it's a repo which is old (2 years since last update) https://github.com/NVIDIA/dgx-selinux/tree/master/src/nvidia-container-selinux
even then, it was meant to be a reference for RHEL-based systems....not verbatim as used in ublue now...
also... the nvidia-container-toolkit
RPM includs oci hooks... but those are deprecated
should we be removing the old hook.json?
and should instead package a unit to generate /etc/cdi/nvidia.yaml
?I think so
i don't really know the consequences of the change... i think with the CDI way, we can use
--device=nvidia.com/gpu=XYX
yeah and we can't use the
-e NVIDIA_GPU_DEVICES
way
other than that not surei kinda wonder if all this is too opinionated
wdym
maybe we remove the selinux policy because we can't really verify it's correct in the first place (seems unlikely)
but how much auto-configuring a system for podman to run nvidia is too much?
hmm... would be nice to have the newer container-toolkit ... we are still on 1.13.5 and 1.14 is where CDI becomes "legit"
I think in the future the podman team will remove the default hooks-dir, so we'd have to set that instead if that happens
though tbh that seems far in the future so we don't have to move to CDI now
and even then setting the hooks-dir in distrobox assemble or whatever would be easy at that point too
yeah... currently our nvidia-container-toolkit is coming from nvidia's rhel9.0 repo so...
looks like they have a new one:
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
huh, and that actually points to the centos8 repos...
yeah
it's not distro specific anymore seems like
i'm going to test that on the ucore side now since I'm setup there right now
booyah! new container-toolkit
i guess it pays to be hard headed if it leads to progress
and... the oci-hook.json is gone 🙂
i think we have our answer on this part at least
the old SElinux policy still works... so... i guess i'm hesitant to remove it, even if the default way is for people to just
label=disable
it doesn't hurt
fair enough, the main concern is that it may break in the future so if that happens I guess we can remove it
agreed
another interesting side-effect of the upgrade to container-toolkit 1.14
they don't provide a default config.toml anymore
What’s that file?
/etc/nvidia-container-runtime/config.toml
was where we were setting
was where we were setting
no_usecgroups = true
to enable rootless mode
i think they fixed it so both rootful and rootless work nowOh that’s nice
Would explain why that disappeared
very nice!
Try distrobox I guess maybe it even works
(But probably not)
hah, well, i don't have distrobox on ucore
i can layer i suppose
correction! i do have it
what am i trying?
--nvidia on ubuntu or something to test nvidia-smi?
Use
—additional-args
to add the nvidia gpu argument
See if the container will startError: OCI runtime error: unable to start container "c528757f2a29cbabc3b997067dedd6cd34060896882a52bbf2b4bb8bb2c2fef5": crun: {"msg":"error executing hook
/usr/bin/nvidia-ctk
(exit code: 1)","level":"error","time":"2023-10-05T03:45:57.174488Z"}
distrobox create --name fedora --additional-flags "--security-opt label=disable --device=nvidia.com/gpu=0"
Yup, same issue
so, distrobox needs an update
Yeah
It’s probably kind of complex update actually
hmm... so... can we set a default device in containers.conf ? 🙂
doesn't look like
i'll argue the selinux repo is more active than it seems at first glance
11 months ago ahttps://github.com/NVIDIA/dgx-selinux/commit/b5ae74df51d9ec013b058d9b472df5ea231fc374
~and a fix a couple months ago https://github.com/NVIDIA/dgx-selinux/pull/8~
not yet merged