Railway•17mo ago

Flask + Gunicorn app repeatedly getting killed and restarting

Service ID: f9e8d800-7f5f-4cdf-a508-830ce6caf939 We have a Flask app deployed using the Gunicorn server; our start command in our Profile is: web: gunicorn -w 1 --threads 300 server:app We recently did a new deploy and started observing our worker getting repeatedly killed and restarted. Here is the error trace that keeps occurring after each restart:

[2023-07-18 01:52:35 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:12962)
Exception ignored in: <module 'threading' from '/root/.nix-profile/lib/python3.9/threading.py'>
Traceback (most recent call last):
File "/root/.nix-profile/lib/python3.9/threading.py", line 1447, in _shutdown
atexit_call()
File "/root/.nix-profile/lib/python3.9/concurrent/futures/thread.py", line 31, in _python_exit
t.join()
File "/root/.nix-profile/lib/python3.9/threading.py", line 1060, in join
self._wait_for_tstate_lock()
File "/root/.nix-profile/lib/python3.9/threading.py", line 1080, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
File "/opt/venv/lib/python3.9/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
sys.exit(1)
SystemExit: 1
[2023-07-18 01:52:36 +0000] [1] [ERROR] Worker (pid:12962) exited with code 255
[2023-07-18 01:52:36 +0000] [1] [ERROR] Worker (pid:12962) exited with code 255.
[2023-07-18 01:52:36 +0000] [13217] [INFO] Booting worker with pid: 13217

[2023-07-18 01:52:35 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:12962)
Exception ignored in: <module 'threading' from '/root/.nix-profile/lib/python3.9/threading.py'>
Traceback (most recent call last):
File "/root/.nix-profile/lib/python3.9/threading.py", line 1447, in _shutdown
atexit_call()
File "/root/.nix-profile/lib/python3.9/concurrent/futures/thread.py", line 31, in _python_exit
t.join()
File "/root/.nix-profile/lib/python3.9/threading.py", line 1060, in join
self._wait_for_tstate_lock()
File "/root/.nix-profile/lib/python3.9/threading.py", line 1080, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
File "/opt/venv/lib/python3.9/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
sys.exit(1)
SystemExit: 1
[2023-07-18 01:52:36 +0000] [1] [ERROR] Worker (pid:12962) exited with code 255
[2023-07-18 01:52:36 +0000] [1] [ERROR] Worker (pid:12962) exited with code 255.
[2023-07-18 01:52:36 +0000] [13217] [INFO] Booting worker with pid: 13217

At first we thought this was due to our code changes, but we have since rolled back to a previous deploy that was working fine before, and we are still observing the same restart issue. Our metrics show that memory usage has remained roughly the same, but CPU usage has spiked for some reason, even though traffic to our server has not significantly increased.

Solution:

Service ID: f9e8d800-7f5f-4cdf-a508-830ce6caf939 We have a Flask app deployed using the Gunicorn server; our start command in our Profile is: web: gunicorn -w 1 --threads 300 server:app ...

Jump to solution

18 Replies

Percy•17mo ago

Project ID: f9e8d800-7f5f-4cdf-a508-830ce6caf939

Brody•17mo ago

can you show a screenshot of the service metrics?

andrews46OP•17mo ago

The CPU spike at 5:20pm is when we did the new deploy and started noticing the restarts

Brody•17mo ago

I'm assuming you are part of the pro plan?

andrews46OP•17mo ago

we're on the hobby plan should be 8 GB memory, right?

Brody•17mo ago

correct critical worker timeout means that your code took longer than 30 seconds (the default timeout) to respond to a request, how long should your app take to respond to a request?

andrews46OP•17mo ago

generally less than a second or two, we do have one endpoint that is longer running the thing is, I increased the gunicorn timeout to 1000, and then the restarts stopped, but the server was unresponsive when I tried to hit an endpoint that should take less than a second, it would hang instead so I think the timeout was indicating a deeper issue

Brody•17mo ago

could this be an unhanded error from within your code or maybe you're doing an external API call and that's hanging? because this would be an issue with your app, and not railway specifically

andrews46OP•17mo ago

we've also rolled back to a previous revert that had been working perfectly fine, and in general we've never encountered this issue in several weeks of having this project live on railway and to test, we also deployed the same code on heroku that's currently on railway, and it's working fine without any issues

Brody•17mo ago

just because it works locally or on another platform doesn't automatically mean it's an issue with railway but there's also not a whole lot I can do to help you here, you'd need to find out what's causing your app to hang

andrews46OP•17mo ago

sure, I'm just trying to think why our previous deploy had been working fine for several days but now when we try to deploy it we're seeing this issue

Brody•17mo ago

failed request to external API? database connection failed? you'll have to do some digging I'm sorry that there's not more I could help you with here

andrews46OP•17mo ago

no worries, thanks for getting back to me

Brody•17mo ago

no problem, and when you figure out what's happening I'd love to know about it!

mf82•17mo ago

I had a similar issue couple of days ago, the culprit was the db connection which works only if you use service variables. Hardcoding variables in the code didn’t work for some reason. HTH

Brody•17mo ago

relevant docs https://docs.railway.app/develop/variables#reference-variables

andrews46OP•17mo ago

So it turned out to be a new version of gunicorn that was released yesterday and introduced a breaking change to our code. We've since locked the versions of our packages in our requirements file and now it's working fine again. Thanks for your help!

Brody•17mo ago

awsome, glad you where able to figure it out, and thanks for coming back and telling us the problem!

Gaming

Programming

Flask + Gunicorn app repeatedly getting killed and restarting