Coder.com•10mo ago

Coder pod count explodes

When coder runs the pod count explodes and most get evicted immediately, how do I resolve this?

40 Replies

Codercord•10mo ago

<#1264976044721438823>

Category

Help needed

Product

Coder OSS (v2)

Platform

Linux

Logs

Please post any relevant logs/error messages.

Phorcys•10mo ago

@Nax what template are you using? did you use any of the example ones?

naxOP•10mo ago

I followed the k8s deployment guide w/ helm beat for beat

Phorcys•10mo ago

yeah this behavior is pretty weird, could you send the logs for your Coder daemon over?

naxOP•10mo ago

Sure, how do I pull those?

Phorcys•10mo ago

or in DMs if you prefer grab the pod's logs via kubectl (e.g kubectl logs coder) assuming you're also running Coder within k8s

naxOP•10mo ago

Yes

Phorcys•10mo ago

to me this is either one of two things: either the template is wrong (but that's probably not the case because that shouldn't cause anything to happen at startup), or there's a bug in provisionerd that causes it to spawn a lot of pods/respawn old pods i've got to go but feel free to ping me so that I can look in a bit

naxOP•10mo ago

scaling it back up to one replication to try and view logs, the pods explode and get evicted before I'm able to view their logs.

Phorcys•10mo ago

even the Coder instance itself?

naxOP•10mo ago

yeah

naxOP•10mo ago

I wish I were more familiar with k8s internals, it's because of these that I'm not wholly confident this is a Coder issue.

Phorcys•10mo ago

oh yeah your node has disk pressure it seems can you send the output of kubectl get componentstatus? how used is your disk? or the partition that hosts the root?

naxOP•10mo ago

how should I check that?

naxOP•10mo ago

Phorcys•10mo ago

oh i sent you the wrong command df -h let me find the right one i forgot what it is try kubectl describe nodes

naxOP•10mo ago

this is df -h

naxOP•10mo ago

message.txt

Phorcys•10mo ago

you're going to need to extend the disk k8s disables itself when the free disk space is too low

naxOP•10mo ago

What I don't understand is why it's filled?

Phorcys•10mo ago

right now the partition on your / (/dev/sda3) is 90% used that I don't know, are you using a lot of different container images in your templates? images can be large I'd recommend installing ncdu to track down what's taking up the most space

naxOP•10mo ago

nothing here is huge in terms of image size

naxOP•10mo ago

Phorcys•10mo ago

try to narrow down which directories are the biggest in your /

naxOP•10mo ago

var @ 18G var/lib @ 15G var/lib/rancher @ 15G

Phorcys•10mo ago

what's the output of du -hxs /*?

naxOP•10mo ago

root@k3s:/# du -hxs /*
0       /bin
253M    /boot
4.0K    /cdrom
0       /dev
5.1M    /etc
592K    /home
0       /lib
0       /lib32
0       /lib64
0       /libx32
16K     /lost+found
4.0K    /media
4.0K    /mnt
4.0K    /opt
du: cannot access '/proc/1226/task/1347/fdinfo/327': No such file or directory
du: cannot access '/proc/1226/task/1472/fd/327': No such file or directory
du: cannot access '/proc/1226/task/1530/fdinfo/327': No such file or directory
du: cannot access '/proc/1846183/task/1846183/fd/4': No such file or directory
du: cannot access '/proc/1846183/task/1846183/fdinfo/4': No such file or directory
du: cannot access '/proc/1846183/fd/3': No such file or directory
du: cannot access '/proc/1846183/fdinfo/3': No such file or directory
0       /proc
26M     /root
7.7M    /run
0       /sbin
8.0K    /snap
4.0K    /srv
4.1G    /swap.img
0       /sys
60K     /tmp
3.0G    /usr
18G     /var
root@k3s:/#

root@k3s:/# du -hxs /*
0       /bin
253M    /boot
4.0K    /cdrom
0       /dev
5.1M    /etc
592K    /home
0       /lib
0       /lib32
0       /lib64
0       /libx32
16K     /lost+found
4.0K    /media
4.0K    /mnt
4.0K    /opt
du: cannot access '/proc/1226/task/1347/fdinfo/327': No such file or directory
du: cannot access '/proc/1226/task/1472/fd/327': No such file or directory
du: cannot access '/proc/1226/task/1530/fdinfo/327': No such file or directory
du: cannot access '/proc/1846183/task/1846183/fd/4': No such file or directory
du: cannot access '/proc/1846183/task/1846183/fdinfo/4': No such file or directory
du: cannot access '/proc/1846183/fd/3': No such file or directory
du: cannot access '/proc/1846183/fdinfo/3': No such file or directory
0       /proc
26M     /root
7.7M    /run
0       /sbin
8.0K    /snap
4.0K    /srv
4.1G    /swap.img
0       /sys
60K     /tmp
3.0G    /usr
18G     /var
root@k3s:/#

Phorcys•10mo ago

yeah honestly you're going to need to extend that disk PersistentVolumeClaims have a maximum amount of size, if that maximum is not available then there's probably some security mechanism in place you can try to see which volume is the biggest but really i'm not surprised by the space it's taking up since you have images and user data

naxOP•9mo ago

But that's the thing, everything I'm running currently isn't using the disk space, only coder is mm, seems I upped the disk space but didn't extend the partition, I still think 30g is plenty for what i had running but I'll expand it to 128gb and see whether the issue subsides

Phorcys•9mo ago

what do you mean? do your workspaces not have persistent data?

naxOP•9mo ago

Sorry isn't using that much disk space Most things are a couple of mbs, coder exploded into using gb

Phorcys•9mo ago

do you have a lot of user data ?

naxOP•9mo ago

Phorcys•9mo ago

could you try to pinpoint what takes the most space in your volumes?

naxOP•9mo ago

uhh no I tried to expand the disk and now the server won't boot 🙃

Phorcys•9mo ago

as seen w/ @Nax, it was indeed a disk space issue

Codercord•9mo ago

@Phorcys closed the thread.

Gaming

Programming

Coder pod count explodes

Did you find this page helpful?