Railway•2y ago

Infinite worker boot/crash loop

Hey, we're been hosting our fastapi backend on railway for a while. Recently, new deploys are getting into an infinite worker boot + crash loop. Trace

OpenBLAS blas_thread_init: pthread_create failed for thread 36 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max

OpenBLAS blas_thread_init: pthread_create failed for thread 36 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max

Any idea what's going on? RLIMIT_NPROC doesn't seem like it should be -1

75 Replies

Percy•2y ago

Project ID: c0204309-5720-4cbf-a480-1314cae460cd

justinwOP•2y ago

c0204309-5720-4cbf-a480-1314cae460cd can you guys take a look? this is causing some downtime for our entire site @Brody

Brody•2y ago

#🛂｜readme #5

justinwOP•2y ago

lol my bad - is there a way to escalate this on a diff channel?

Brody•2y ago

this server is community support, and only really escalated when there is an issue with the platform itself

justinwOP•2y ago

AFAIK we haven't changed any of our own settings - which is why im jumping into the discord. I think it's a platform thing could be wrong but drawing blanks on our end after debugging for a bit

Brody•2y ago

uvicorn right?

justinwOP•2y ago

yeah

Brody•2y ago

whats your start command

justinwOP•2y ago

export WORKER_CLASS=${WORKER_CLASS:-"uvicorn.workers.UvicornWorker"} gunicorn --worker-class "$WORKER_CLASS" --config "$GUNICORN_CONF" "$APP_MODULE export GUNICORN_CONF=${GUNICORN_CONF:-$DEFAULT_GUNICORN_CONF} one sec grabbing the config

import json
import multiprocessing
import os

workers_per_core_str = os.getenv("WORKERS_PER_CORE", "1")
max_workers_str = os.getenv("MAX_WORKERS")
use_max_workers = None
if max_workers_str:
    use_max_workers = int(max_workers_str)
web_concurrency_str = os.getenv("WEB_CONCURRENCY", None)

host = os.getenv("HOST", "0.0.0.0")
port = os.getenv("PORT", "80")
bind_env = os.getenv("BIND", None)
use_loglevel = os.getenv("LOG_LEVEL", "debug")
if bind_env:
    use_bind = bind_env
else:
    use_bind = f"{host}:{port}"

cores = multiprocessing.cpu_count()
workers_per_core = float(workers_per_core_str)
default_web_concurrency = workers_per_core * cores
if web_concurrency_str:
    web_concurrency = int(web_concurrency_str)
    assert web_concurrency > 0
else:
    web_concurrency = max(int(default_web_concurrency), 2)
    if use_max_workers:
        web_concurrency = min(web_concurrency, use_max_workers)
accesslog_var = os.getenv("ACCESS_LOG", "-")
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" [worker: %(p)s]'
use_accesslog = accesslog_var or None
errorlog_var = os.getenv("ERROR_LOG", "-")
use_errorlog = errorlog_var or None
graceful_timeout_str = os.getenv("GRACEFUL_TIMEOUT", "120")
timeout_str = os.getenv("TIMEOUT", "120")
keepalive_str = os.getenv("KEEP_ALIVE", "5")

# Gunicorn config variables
loglevel = use_loglevel
workers = web_concurrency
bind = use_bind
errorlog = use_errorlog
worker_tmp_dir = "/dev/shm"
accesslog = use_accesslog
graceful_timeout = int(graceful_timeout_str)
timeout = int(timeout_str)
keepalive = int(keepalive_str)


# For debugging and testing
log_data = {
    "loglevel": loglevel,
    "workers": workers,
    "bind": bind,
    "graceful_timeout": graceful_timeout,
    "timeout": timeout,
    "keepalive": keepalive,
    "errorlog": errorlog,
    "accesslog": accesslog,
    # Additional, non-gunicorn variables
    "workers_per_core": workers_per_core,
    "use_max_workers": use_max_workers,
    "host": host,
    "port": port,
    "limit_request_line": 0,
    "limit_request_fields": 0,
    "limit_request_field_size": 0,
    "access_log_format": access_log_format,
}
print(json.dumps(log_data))

import json
import multiprocessing
import os

workers_per_core_str = os.getenv("WORKERS_PER_CORE", "1")
max_workers_str = os.getenv("MAX_WORKERS")
use_max_workers = None
if max_workers_str:
    use_max_workers = int(max_workers_str)
web_concurrency_str = os.getenv("WEB_CONCURRENCY", None)

host = os.getenv("HOST", "0.0.0.0")
port = os.getenv("PORT", "80")
bind_env = os.getenv("BIND", None)
use_loglevel = os.getenv("LOG_LEVEL", "debug")
if bind_env:
    use_bind = bind_env
else:
    use_bind = f"{host}:{port}"

cores = multiprocessing.cpu_count()
workers_per_core = float(workers_per_core_str)
default_web_concurrency = workers_per_core * cores
if web_concurrency_str:
    web_concurrency = int(web_concurrency_str)
    assert web_concurrency > 0
else:
    web_concurrency = max(int(default_web_concurrency), 2)
    if use_max_workers:
        web_concurrency = min(web_concurrency, use_max_workers)
accesslog_var = os.getenv("ACCESS_LOG", "-")
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s" [worker: %(p)s]'
use_accesslog = accesslog_var or None
errorlog_var = os.getenv("ERROR_LOG", "-")
use_errorlog = errorlog_var or None
graceful_timeout_str = os.getenv("GRACEFUL_TIMEOUT", "120")
timeout_str = os.getenv("TIMEOUT", "120")
keepalive_str = os.getenv("KEEP_ALIVE", "5")

# Gunicorn config variables
loglevel = use_loglevel
workers = web_concurrency
bind = use_bind
errorlog = use_errorlog
worker_tmp_dir = "/dev/shm"
accesslog = use_accesslog
graceful_timeout = int(graceful_timeout_str)
timeout = int(timeout_str)
keepalive = int(keepalive_str)


# For debugging and testing
log_data = {
    "loglevel": loglevel,
    "workers": workers,
    "bind": bind,
    "graceful_timeout": graceful_timeout,
    "timeout": timeout,
    "keepalive": keepalive,
    "errorlog": errorlog,
    "accesslog": accesslog,
    # Additional, non-gunicorn variables
    "workers_per_core": workers_per_core,
    "use_max_workers": use_max_workers,
    "host": host,
    "port": port,
    "limit_request_line": 0,
    "limit_request_fields": 0,
    "limit_request_field_size": 0,
    "access_log_format": access_log_format,
}
print(json.dumps(log_data))

Brody•2y ago

i need a simple answer here, how many workers does this end up running with? (on railway)

justinwOP•2y ago

MAX_WORKERS is set to 3 on the railway config panel

Brody•2y ago

thats cool, but is gunicorn seeing that number?

justinwOP•2y ago

yes

Brody•2y ago

how can you be sure

justinwOP•2y ago

web_concurrency_str = os.getenv("WEB_CONCURRENCY", None)

Brody•2y ago

that doesnt really prove anything

justinwOP•2y ago

workers = web_concurrency

Brody•2y ago

that doesnt really prove anything

justinwOP•2y ago

log_data = {

"loglevel": loglevel, "workers": workers, "bind": bind,

Brody•2y ago

again, not proving much id like to see logs that say something like "spawning 3 workers"

Gaming

Programming

Infinite worker boot/crash loop