Infinite worker boot/crash loop

Hey, we're been hosting our fastapi backend on railway for a while. Recently, new deploys are getting into an infinite worker boot + crash loop. Trace
OpenBLAS blas_thread_init: pthread_create failed for thread 36 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
OpenBLAS blas_thread_init: pthread_create failed for thread 36 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
Any idea what's going on? RLIMIT_NPROC doesn't seem like it should be -1
75 Replies
Percy
Percy2y ago
Project ID: c0204309-5720-4cbf-a480-1314cae460cd
justinw
justinw2y ago
c0204309-5720-4cbf-a480-1314cae460cd can you guys take a look? this is causing some downtime for our entire site @Brody
Brody
Brody2y ago
#🛂|readme #5
justinw
justinw2y ago
lol my bad - is there a way to escalate this on a diff channel?
Brody
Brody2y ago
this server is community support, and only really escalated when there is an issue with the platform itself
justinw
justinw2y ago
AFAIK we haven't changed any of our own settings - which is why im jumping into the discord. I think it's a platform thing could be wrong but drawing blanks on our end after debugging for a bit
Brody
Brody2y ago
uvicorn right?
justinw
justinw2y ago
yeah
Brody
Brody2y ago
whats your start command
justinw
justinw2y ago
export WORKER_CLASS=${WORKER_CLASS:-"uvicorn.workers.UvicornWorker"} gunicorn --worker-class "$WORKER_CLASS" --config "$GUNICORN_CONF" "$APP_MODULE export GUNICORN_CONF=${GUNICORN_CONF:-$DEFAULT_GUNICORN_CONF} one sec grabbing the config
import json
import multiprocessing
import os

workers_per_core_str = os.getenv("WORKERS_PER_CORE", "1")
max_workers_str = os.getenv("MAX_WORKERS")
use_max_workers = None
if max_workers_str:
use_max_workers = int(max_workers_str)
web_concurrency_str = os.getenv("WEB_CONCURRENCY", None)

host = os.getenv("HOST", "0.0.0.0")
port = os.getenv("PORT", "80")
bind_env = os.getenv("BIND", None)
use_loglevel = os.getenv("LOG_LEVEL", "debug")
if bind_env:
use_bind = bind_env
else:
use_bind = f"{host}:{port}"

cores = multiprocessing.cpu_count()
workers_per_core = float(workers_per_core_str)
default_web_concurrency = workers_per_core * cores
if web_concurrency_str:
web_concurrency = int(web_concurrency_str)
assert web_concurrency > 0
else:
web_concurrency = max(int(default_web_concurrency), 2)
if use_max_workers:
web_concurrency = min(web_concurrency, use_max_workers)
accesslog_var = os.getenv("ACCESS_LOG", "-")
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" [worker: %(p)s]'
use_accesslog = accesslog_var or None
errorlog_var = os.getenv("ERROR_LOG", "-")
use_errorlog = errorlog_var or None
graceful_timeout_str = os.getenv("GRACEFUL_TIMEOUT", "120")
timeout_str = os.getenv("TIMEOUT", "120")
keepalive_str = os.getenv("KEEP_ALIVE", "5")

# Gunicorn config variables
loglevel = use_loglevel
workers = web_concurrency
bind = use_bind
errorlog = use_errorlog
worker_tmp_dir = "/dev/shm"
accesslog = use_accesslog
graceful_timeout = int(graceful_timeout_str)
timeout = int(timeout_str)
keepalive = int(keepalive_str)


# For debugging and testing
log_data = {
"loglevel": loglevel,
"workers": workers,
"bind": bind,
"graceful_timeout": graceful_timeout,
"timeout": timeout,
"keepalive": keepalive,
"errorlog": errorlog,
"accesslog": accesslog,
# Additional, non-gunicorn variables
"workers_per_core": workers_per_core,
"use_max_workers": use_max_workers,
"host": host,
"port": port,
"limit_request_line": 0,
"limit_request_fields": 0,
"limit_request_field_size": 0,
"access_log_format": access_log_format,
}
print(json.dumps(log_data))
import json
import multiprocessing
import os

workers_per_core_str = os.getenv("WORKERS_PER_CORE", "1")
max_workers_str = os.getenv("MAX_WORKERS")
use_max_workers = None
if max_workers_str:
use_max_workers = int(max_workers_str)
web_concurrency_str = os.getenv("WEB_CONCURRENCY", None)

host = os.getenv("HOST", "0.0.0.0")
port = os.getenv("PORT", "80")
bind_env = os.getenv("BIND", None)
use_loglevel = os.getenv("LOG_LEVEL", "debug")
if bind_env:
use_bind = bind_env
else:
use_bind = f"{host}:{port}"

cores = multiprocessing.cpu_count()
workers_per_core = float(workers_per_core_str)
default_web_concurrency = workers_per_core * cores
if web_concurrency_str:
web_concurrency = int(web_concurrency_str)
assert web_concurrency > 0
else:
web_concurrency = max(int(default_web_concurrency), 2)
if use_max_workers:
web_concurrency = min(web_concurrency, use_max_workers)
accesslog_var = os.getenv("ACCESS_LOG", "-")
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" [worker: %(p)s]'
use_accesslog = accesslog_var or None
errorlog_var = os.getenv("ERROR_LOG", "-")
use_errorlog = errorlog_var or None
graceful_timeout_str = os.getenv("GRACEFUL_TIMEOUT", "120")
timeout_str = os.getenv("TIMEOUT", "120")
keepalive_str = os.getenv("KEEP_ALIVE", "5")

# Gunicorn config variables
loglevel = use_loglevel
workers = web_concurrency
bind = use_bind
errorlog = use_errorlog
worker_tmp_dir = "/dev/shm"
accesslog = use_accesslog
graceful_timeout = int(graceful_timeout_str)
timeout = int(timeout_str)
keepalive = int(keepalive_str)


# For debugging and testing
log_data = {
"loglevel": loglevel,
"workers": workers,
"bind": bind,
"graceful_timeout": graceful_timeout,
"timeout": timeout,
"keepalive": keepalive,
"errorlog": errorlog,
"accesslog": accesslog,
# Additional, non-gunicorn variables
"workers_per_core": workers_per_core,
"use_max_workers": use_max_workers,
"host": host,
"port": port,
"limit_request_line": 0,
"limit_request_fields": 0,
"limit_request_field_size": 0,
"access_log_format": access_log_format,
}
print(json.dumps(log_data))
Brody
Brody2y ago
i need a simple answer here, how many workers does this end up running with? (on railway)
justinw
justinw2y ago
MAX_WORKERS is set to 3 on the railway config panel
Brody
Brody2y ago
thats cool, but is gunicorn seeing that number?
justinw
justinw2y ago
yes
Brody
Brody2y ago
how can you be sure
justinw
justinw2y ago
web_concurrency_str = os.getenv("WEB_CONCURRENCY", None)
Brody
Brody2y ago
that doesnt really prove anything
justinw
justinw2y ago
workers = web_concurrency
Brody
Brody2y ago
that doesnt really prove anything
justinw
justinw2y ago
log_data = {
"loglevel": loglevel, "workers": workers, "bind": bind,
Brody
Brody2y ago
again, not proving much id like to see logs that say something like "spawning 3 workers"
tonyfrancisv
tonyfrancisv2y ago
Jumping in bc i deployed this stuff and am more familiar
GUNICORN_CONF=/backend/app/gunicorn_conf.py
Checking for script in /backend/prestart.sh
Running script /backend/prestart.sh
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
INFO:__main__:Creating initial data
INFO:__main__:Initial data created
Starting Gunicorn with app.main:app
[2023-06-06 15:17:58 +0000] [1] [DEBUG] Current configuration:
config: /backend/app/gunicorn_conf.py
wsgi_app: None
bind: ['0.0.0.0:5468']
backlog: 2048
workers: 10
worker_class: uvicorn.workers.UvicornWorker
threads: 1
worker_connections: 1000
max_requests: 0
max_requests_jitter: 0
timeout: 120
graceful_timeout: 120
keepalive: 5
limit_request_line: 4094
limit_request_fields: 100
limit_request_field_size: 8190
reload: False
reload_engine: auto
reload_extra_files: []
spew: False
check_config: False
print_config: False
preload_app: False
sendfile: None
reuse_port: False
chdir: /backend
daemon: False
raw_env: []
pidfile: None
worker_tmp_dir: /dev/shm
user: 0
group: 0
umask: 0
initgroups: False
tmp_upload_dir: None
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
forwarded_allow_ips: ['127.0.0.1']
accesslog: -
disable_redirect_access_to_syslog: False
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" [worker: %(p)s]
errorlog: -
loglevel: debug
capture_output: False
logger_class: gunicorn.glogging.Logger
logconfig: None
logconfig_dict: {}
syslog_addr: udp://localhost:514
syslog: False
syslog_prefix: None
syslog_facility: user
enable_stdio_inheritance: False
statsd_host: None
dogstatsd_tags:
statsd_prefix:
proc_name: None
default_proc_name: app.main:app
pythonpath: None
paste: None
GUNICORN_CONF=/backend/app/gunicorn_conf.py
Checking for script in /backend/prestart.sh
Running script /backend/prestart.sh
INFO [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO [alembic.runtime.migration] Will assume transactional DDL.
INFO:__main__:Creating initial data
INFO:__main__:Initial data created
Starting Gunicorn with app.main:app
[2023-06-06 15:17:58 +0000] [1] [DEBUG] Current configuration:
config: /backend/app/gunicorn_conf.py
wsgi_app: None
bind: ['0.0.0.0:5468']
backlog: 2048
workers: 10
worker_class: uvicorn.workers.UvicornWorker
threads: 1
worker_connections: 1000
max_requests: 0
max_requests_jitter: 0
timeout: 120
graceful_timeout: 120
keepalive: 5
limit_request_line: 4094
limit_request_fields: 100
limit_request_field_size: 8190
reload: False
reload_engine: auto
reload_extra_files: []
spew: False
check_config: False
print_config: False
preload_app: False
sendfile: None
reuse_port: False
chdir: /backend
daemon: False
raw_env: []
pidfile: None
worker_tmp_dir: /dev/shm
user: 0
group: 0
umask: 0
initgroups: False
tmp_upload_dir: None
secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
forwarded_allow_ips: ['127.0.0.1']
accesslog: -
disable_redirect_access_to_syslog: False
access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" [worker: %(p)s]
errorlog: -
loglevel: debug
capture_output: False
logger_class: gunicorn.glogging.Logger
logconfig: None
logconfig_dict: {}
syslog_addr: udp://localhost:514
syslog: False
syslog_prefix: None
syslog_facility: user
enable_stdio_inheritance: False
statsd_host: None
dogstatsd_tags:
statsd_prefix:
proc_name: None
default_proc_name: app.main:app
pythonpath: None
paste: None
These are the logs from boot we do 10 workers none of this has changed in months issue started this week
Brody
Brody2y ago
lol is it 10 or 3
tonyfrancisv
tonyfrancisv2y ago
It is 10. I can show you the worker boot logs if you really want proof
Brody
Brody2y ago
is this service being run on a developer account?
tonyfrancisv
tonyfrancisv2y ago
Wdym
Brody
Brody2y ago
nah you seen trustworthy
tonyfrancisv
tonyfrancisv2y ago
lol
justinw
justinw2y ago
gatekeeping access
Brody
Brody2y ago
it aint like that whos account runs the service with the issues?
tonyfrancisv
tonyfrancisv2y ago
It's a team I created under my account we're both on it
Brody
Brody2y ago
odd you dont have a team badge
tonyfrancisv
tonyfrancisv2y ago
Lol we dont really spend our time on your discord
Brody
Brody2y ago
fair
tonyfrancisv
tonyfrancisv2y ago
do you have any priority channels or slack for teams? would be willing to pay for a tier with SLAs
Brody
Brody2y ago
priority support is done through email afaik [email protected]
justinw
justinw2y ago
@Brody out of curosity are you a FTE at railway or is this volunteer discord?
Brody
Brody2y ago
FTE?
justinw
justinw2y ago
full time employee
Brody
Brody2y ago
ah im just a volunteer
justinw
justinw2y ago
ahh gotcha
tonyfrancisv
tonyfrancisv2y ago
Oh that makes more sense
Brody
Brody2y ago
and im not even a python dev so that should make even more sense
justinw
justinw2y ago
true
Brody
Brody2y ago
i try
tonyfrancisv
tonyfrancisv2y ago
lmaoo bless your soul
Brody
Brody2y ago
now since im not a python dev.. have you tried googling your error?
tonyfrancisv
tonyfrancisv2y ago
anyways, will reach out to support. but any idea what's going on here? is there a max number of workers supported per container? not really sure what the auto-scaling scheme is here.
Brody
Brody2y ago
im confident that this wouldn't be an error specific to railway, since you definably could run into such errors elsewhere too, but it is perplexing
tonyfrancisv
tonyfrancisv2y ago
what's the max number of threads we can run on a container
Brody
Brody2y ago
is there a max number of workers supported per container?
thats not really defined by railway, you have 8 vcpu so thats a max of 17 workers according to gunicorns docs (2 x $num_cores) + 1
not really sure what the auto-scaling scheme is here.
its vertical scaling per container
tonyfrancisv
tonyfrancisv2y ago
any reason why RLIMIT_NPROC is -1 ? instead of being set to the number of threads allowed usually set at a system level in linux systems
Brody
Brody2y ago
RLIMIT_NPROC is not something railway sets no idea on why it would be -1 though
tonyfrancisv
tonyfrancisv2y ago
yeah, vs not being set at all
Brody
Brody2y ago
can i see a screenshot of metrics from a service that did work with 10 workers?
tonyfrancisv
tonyfrancisv2y ago
yeah anything in particular you're looking for this issue is itermittent, most of the time it boots fine all of the services in the backend spin up on boot. some of them probably have libs that have their own threading under the hood. probably some race condition depending on the order of them actually executing/connecting
Brody
Brody2y ago
using too many resources, since 10 workers may be too many if they are all using a lot of resources
tonyfrancisv
tonyfrancisv2y ago
yeah, tbh its opaque on the usage side bc libs like OpenBLAS will just look for RLIMIT_NPROC I'm just gonna manually set that var and see if it resolves the issue assuming its not going to break anything on the railway side of things
Brody
Brody2y ago
it wont
tonyfrancisv
tonyfrancisv2y ago
how do you decide to scale? is it % cpu usage?
Brody
Brody2y ago
does OPENBLAS_NUM_THREADS mean anything to you? https://stackoverflow.com/a/57549064 theres no auto scaling?
tonyfrancisv
tonyfrancisv2y ago
yeah, that's what numpy uses under the hood for threading. setting it to 1 disables multithreading in the c parts of numpy. bad for perfomance but something that ppl sometimes choose to do to prevent peaky behavior in their autoscaler how do you decide when to spin up another container
Brody
Brody2y ago
we dont, its a fixed number
tonyfrancisv
tonyfrancisv2y ago
ahhhhh okay wait so you just vertically scale a fixed number of containers?
Brody
Brody2y ago
by default its 1, otherwise its the number you have specified in the replica amount vertical in the sense that your app on the team account gets 32vpu and 32gb ram and is free to use up to that nothing is automatic at this time
tonyfrancisv
tonyfrancisv2y ago
oh woah, did not realize that
Brody
Brody2y ago
do you have any replicas?
tonyfrancisv
tonyfrancisv2y ago
so there is a theoretical max qps we can handle. good to know im looking for where replicas are in the dashboard nothing set explicitly atm
Brody
Brody2y ago
in the service settings, but adding replicas will only make things worse, you need your singular service to be stable first
tonyfrancisv
tonyfrancisv2y ago
well yeah, that's a separate question
Brody
Brody2y ago
gotcha try 4 workers?
tonyfrancisv
tonyfrancisv2y ago
yeah i mean that should make it less likely to happen since there are fewer web workers. but ideally this should be dynamic so we can saturate the resources available
Brody
Brody2y ago
well your code would be in charge of per container worker scaling
angelo
angelo2y ago
hey Tony! and would be under service settings
Brody
Brody2y ago
lol
Want results from more Discord servers?
Add your server