Infinite worker boot/crash loop
Hey, we're been hosting our fastapi backend on railway for a while. Recently, new deploys are getting into an infinite worker boot + crash loop. Trace
Any idea what's going on? RLIMIT_NPROC doesn't seem like it should be -1
75 Replies
Project ID:
c0204309-5720-4cbf-a480-1314cae460cd
c0204309-5720-4cbf-a480-1314cae460cd
can you guys take a look? this is causing some downtime for our entire site
@Brody
#🛂|readme #5
lol
my bad - is there a way to escalate this on a diff channel?
this server is community support, and only really escalated when there is an issue with the platform itself
AFAIK we haven't changed any of our own settings - which is why im jumping into the discord. I think it's a platform thing
could be wrong but drawing blanks on our end after debugging for a bit
uvicorn right?
yeah
whats your start command
export WORKER_CLASS=${WORKER_CLASS:-"uvicorn.workers.UvicornWorker"}
gunicorn --worker-class "$WORKER_CLASS" --config "$GUNICORN_CONF" "$APP_MODULE
export GUNICORN_CONF=${GUNICORN_CONF:-$DEFAULT_GUNICORN_CONF}
one sec grabbing the config
i need a simple answer here, how many workers does this end up running with?
(on railway)
MAX_WORKERS is set to 3 on the railway config panel
thats cool, but is gunicorn seeing that number?
yes
how can you be sure
web_concurrency_str = os.getenv("WEB_CONCURRENCY", None)
that doesnt really prove anything
workers = web_concurrency
that doesnt really prove anything
log_data = {"loglevel": loglevel, "workers": workers, "bind": bind,
again, not proving much
id like to see logs that say something like "spawning 3 workers"
Jumping in bc i deployed this stuff and am more familiar
These are the logs from boot
we do 10 workers
none of this has changed in months
issue started this week
lol is it 10 or 3
It is 10.
I can show you the worker boot logs if you really want proof
is this service being run on a developer account?
Wdym
nah you seen trustworthy
lol
gatekeeping
access
it aint like that
whos account runs the service with the issues?
It's a team I created under my account
we're both on it
odd you dont have a team badge
Lol we dont really spend our time on your discord
fair
do you have any priority channels or slack for teams?
would be willing to pay for a tier with SLAs
priority support is done through email afaik
[email protected]
@Brody out of curosity are you a FTE at railway or is this volunteer discord?
FTE?
full time employee
ah im just a volunteer
ahh gotcha
Oh that makes more sense
and im not even a python dev
so that should make even more sense
true
i try
lmaoo
bless your soul
now since im not a python dev.. have you tried googling your error?
anyways, will reach out to support. but any idea what's going on here? is there a max number of workers supported per container? not really sure what the auto-scaling scheme is here.
im confident that this wouldn't be an error specific to railway, since you definably could run into such errors elsewhere too, but it is perplexing
what's the max number of threads we can run on a container
is there a max number of workers supported per container?thats not really defined by railway, you have 8 vcpu so thats a max of 17 workers according to gunicorns docs
(2 x $num_cores) + 1
not really sure what the auto-scaling scheme is here.its vertical scaling per container
any reason why
RLIMIT_NPROC
is -1
? instead of being set to the number of threads allowed
usually set at a system level in linux systemsRLIMIT_NPROC
is not something railway sets
no idea on why it would be -1 thoughyeah, vs not being set at all
can i see a screenshot of metrics from a service that did work with 10 workers?
yeah anything in particular you're looking for
this issue is itermittent, most of the time it boots fine
all of the services in the backend spin up on boot. some of them probably have libs that have their own threading under the hood. probably some race condition depending on the order of them actually executing/connecting
using too many resources, since 10 workers may be too many if they are all using a lot of resources
yeah, tbh its opaque on the usage side bc libs like OpenBLAS will just look for
RLIMIT_NPROC
I'm just gonna manually set that var and see if it resolves the issue
assuming its not going to break anything on the railway side of thingsit wont
how do you decide to scale? is it % cpu usage?
does
OPENBLAS_NUM_THREADS
mean anything to you?
https://stackoverflow.com/a/57549064
theres no auto scaling?yeah, that's what numpy uses under the hood for threading. setting it to 1 disables multithreading in the c parts of numpy. bad for perfomance but something that ppl sometimes choose to do to prevent peaky behavior in their autoscaler
how do you decide when to spin up another container
we dont, its a fixed number
ahhhhh okay
wait so you just vertically scale a fixed number of containers?
by default its 1, otherwise its the number you have specified in the replica amount
vertical in the sense that your app on the team account gets 32vpu and 32gb ram and is free to use up to that
nothing is automatic at this time
oh woah, did not realize that
do you have any replicas?
so there is a theoretical max qps we can handle. good to know
im looking for where replicas are in the dashboard
nothing set explicitly atm
in the service settings, but adding replicas will only make things worse, you need your singular service to be stable first
well yeah, that's a separate question
gotcha
try 4 workers?
yeah i mean that should make it less likely to happen since there are fewer web workers. but ideally this should be dynamic so we can saturate the resources available
well your code would be in charge of per container worker scaling
hey Tony! and would be under service settings
lol