Devcontainer Template Ignores GPU Limits (All GPUs Visible)

Hey folks 👋 I'm running into an issue with GPU isolation in two different Kubernetes-based Coder templates: Template A: Uses the Kubernetes (Deployment) template → GPU isolation works as expected. If I select 1 GPU, the container only sees 1 via nvidia-smi. Template B: Uses the Kubernetes (Devcontainer) template → Even when I select 1 GPU, the container sees all available GPUs on the host. Both templates configure GPU resources like this:
resources {
requests = {
"nvidia.com/gpu" = data.coder_parameter.gpu.value
}
limits = {
"nvidia.com/gpu" = data.coder_parameter.gpu.value
}
}
resources {
requests = {
"nvidia.com/gpu" = data.coder_parameter.gpu.value
}
limits = {
"nvidia.com/gpu" = data.coder_parameter.gpu.value
}
}
One key difference is that in the Devcontainer template, I had to add the following to the security_context:
security_context {
run_as_user = 0
privileged = true
}
security_context {
run_as_user = 0
privileged = true
}
I suspect this might be allowing the container to bypass Kubernetes’ GPU isolation, but I’m not sure how to safely lock it down and still allow the build process to succeed. Has anyone dealt with this before? Is there a way to use envbuilder + GPU isolation without needing to run as root/privileged? Any pointers would be much appreciated 🙏
7 Replies
Codercord
Codercord7d ago
<#1354767638777172029>
Category
Help needed
Product
Coder (v2)
Platform
Linux
Logs
Please post any relevant logs/error messages.
Phorcys
Phorcys3d ago
hey, what made you need to add privileged = true and running as root? this shouldn't be needed at all and is very insecure, i also think that's what causing your GPU isolation issue could you provide error messages if any?
Mikel
MikelOP3d ago
Hey @Phorcys, thanks for the reply! I needed to add privileged = true and run_as_user = 0 because without them, the build process was failing with errors related to file access and GPU initialization. Specifically, I was getting errors like:
error building stage: failed to get filesystem from image: error removing lib to make way for new symlink: unlinkat //lib/firmware/nvidia/560.35.05/gsp_ga10x.bin: device or resource busy
error building stage: failed to get filesystem from image: error removing lib to make way for new symlink: unlinkat //lib/firmware/nvidia/560.35.05/gsp_ga10x.bin: device or resource busy
It seems that without privileged mode, the container couldn’t access the necessary GPU resources during the build process. From your experience, do you think there’s a way to configure the GPU access more securely without using privileged mode? I suspect that the privileged setting is indeed causing the GPU isolation issue, but I’m not sure how to bypass the file access issues without it. Full logs attached
Mikel
MikelOP3d ago
I saw the idea of using priviledged = true here: https://github.com/coder/envbuilder/issues/143#issuecomment-2192405828
GitHub
Investigate GPU support · Issue #143 · coder/envbuilder
Some users will want to mount a GPU to an envbuilder-backed workspace. Can we investigate in which scenarios (if any) this works today and if/how we can patch upstream Kaniko to improve the experie...
Phorcys
Phorcys3d ago
cc @Atif i think there's some nvidia images you can use with the drivers preinstalled, which could fix it but unsure i'm on the go right now but I should be able to get back to you later today once i settle down
Mikel
MikelOP3d ago
Great! Any kind of guidance here would be very much appreciated 😉 Hey folks, just to add some context, I'm using the repo https://github.com/BrunoQuaresma/envbuilder-gpu-testwith the init script configured as /tmp/vectorAdd (as suggested in the github issue) . I'm encountering the following error during the build process:
Failed to build: do build: error building stage: failed to get filesystem from image: error removing var/run to make way for new symlink: unlinkat /var/run/secrets/kubernetes.io/serviceaccount/namespace: read-only file system
Failed to build: do build: error building stage: failed to get filesystem from image: error removing var/run to make way for new symlink: unlinkat /var/run/secrets/kubernetes.io/serviceaccount/namespace: read-only file system
Phorcys
Phorcys3d ago
fyi, it probably won't be today as i was out late, expect a reply this week and ping me if i forget (we are at KubeCon EU)

Did you find this page helpful?