"Too many open files in system"
I am using many cpu3c-2-4 in RO region, all working off of the same volume and keep running into "Too many open files" error. Error only happens in CPU pods, and only when many different pods are working with many different files, such as large apt-get installs and large tar gzips. I have tried setting
ulimit -n [LARGE_NUMBER]
, but this does not fix the error.
Any ideas?13 Replies
will try to reproduce this, if you have easy command to reproduce please share
@flash-singh Hmm yeah, error is occurring inconsistently. Running fine at the moment...
Something interesting: after "Too many open files" error occurs, running
lsof -u root | wc -l
also errors with lsof: can't fopen(/proc/mounts, "r"): No such file or directory
I then try ps -e
and get an error telling me to run mount -t proc proc /proc
. I run this, then lsof
starts working again
During large gzip, before error, lsof -u root | wc -l
returns about 344 open files, so nowhere near limit set by ulimit@flash-singh can a host's machine's limit influence Docker container's limit?
If you have more than one pod on a host, then it would makes sense
no cpu pods run in a vm, hence why you can use docker in cpu pods, i would have to debug more to see why that errror occurs since the container has root access
By the way, why don’t you want to provide GPU pods as a VM too?
we plan to, we initially launched without doing that and thats why currently they dont do the same, also gpus in vm require more work than cpus
CPU pods have been running fine this morning after switching from
runpod/base:0.5.1-cpu
to ubuntu:latest
@Merrell we use ubuntu:latest for runpod/base:0.5.1-cpu?
Looks like its
ubuntu:20.04
not ubuntu:latest
https://github.com/runpod/containers/blob/main/official-templates/base/docker-bake.hcl#L19GitHub
containers/official-templates/base/docker-bake.hcl at main · runpod...
🐳 | Dockerfiles for the RunPod container images used for our official templates. - runpod/containers
ubuntu:20.04, would latest be preferred?
we can run tests, big issue here is too many open files error
@flash-singh / @Merrell
Just wanted to note that I came across this issue helping my friend with CPU Pods, when she was processing a bunch of images using Keras
Unable to set ulimit too. Can't share the data b/c it's for private research, but just wanted to flag, that this basically makes using tensorflow / keras not that helpful on CPU Pods.
We were processing about 9000 images with keras
I can DM for more info if necessary, not sure how to easily share a repro cause of the amt of files / time it takes to run the script
We ran it three times on CPU Pods, to see various attempts to unblock it, and just ended up moving to GPU Pods where it is not an issue
got it, will run some tests, on gpu pods we set higher limits, on cpu pods we dont set any limit, maybe theres a default limit applied