Flask + Gunicorn app repeatedly getting killed and restarting

Service ID: f9e8d800-7f5f-4cdf-a508-830ce6caf939 We have a Flask app deployed using the Gunicorn server; our start command in our Profile is: web: gunicorn -w 1 --threads 300 server:app We recently did a new deploy and started observing our worker getting repeatedly killed and restarted. Here is the error trace that keeps occurring after each restart:
[2023-07-18 01:52:35 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:12962)
Exception ignored in: <module 'threading' from '/root/.nix-profile/lib/python3.9/threading.py'>
Traceback (most recent call last):
File "/root/.nix-profile/lib/python3.9/threading.py", line 1447, in _shutdown
atexit_call()
File "/root/.nix-profile/lib/python3.9/concurrent/futures/thread.py", line 31, in _python_exit
t.join()
File "/root/.nix-profile/lib/python3.9/threading.py", line 1060, in join
self._wait_for_tstate_lock()
File "/root/.nix-profile/lib/python3.9/threading.py", line 1080, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
File "/opt/venv/lib/python3.9/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
sys.exit(1)
SystemExit: 1
[2023-07-18 01:52:36 +0000] [1] [ERROR] Worker (pid:12962) exited with code 255
[2023-07-18 01:52:36 +0000] [1] [ERROR] Worker (pid:12962) exited with code 255.
[2023-07-18 01:52:36 +0000] [13217] [INFO] Booting worker with pid: 13217
[2023-07-18 01:52:35 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:12962)
Exception ignored in: <module 'threading' from '/root/.nix-profile/lib/python3.9/threading.py'>
Traceback (most recent call last):
File "/root/.nix-profile/lib/python3.9/threading.py", line 1447, in _shutdown
atexit_call()
File "/root/.nix-profile/lib/python3.9/concurrent/futures/thread.py", line 31, in _python_exit
t.join()
File "/root/.nix-profile/lib/python3.9/threading.py", line 1060, in join
self._wait_for_tstate_lock()
File "/root/.nix-profile/lib/python3.9/threading.py", line 1080, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
File "/opt/venv/lib/python3.9/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
sys.exit(1)
SystemExit: 1
[2023-07-18 01:52:36 +0000] [1] [ERROR] Worker (pid:12962) exited with code 255
[2023-07-18 01:52:36 +0000] [1] [ERROR] Worker (pid:12962) exited with code 255.
[2023-07-18 01:52:36 +0000] [13217] [INFO] Booting worker with pid: 13217
At first we thought this was due to our code changes, but we have since rolled back to a previous deploy that was working fine before, and we are still observing the same restart issue. Our metrics show that memory usage has remained roughly the same, but CPU usage has spiked for some reason, even though traffic to our server has not significantly increased.
Solution:
Service ID: f9e8d800-7f5f-4cdf-a508-830ce6caf939 We have a Flask app deployed using the Gunicorn server; our start command in our Profile is: web: gunicorn -w 1 --threads 300 server:app ...
Jump to solution
18 Replies
Percy
Percy12mo ago
Project ID: f9e8d800-7f5f-4cdf-a508-830ce6caf939
Brody
Brody12mo ago
can you show a screenshot of the service metrics?
andrews46
andrews4612mo ago
The CPU spike at 5:20pm is when we did the new deploy and started noticing the restarts
Brody
Brody12mo ago
I'm assuming you are part of the pro plan?
andrews46
andrews4612mo ago
we're on the hobby plan should be 8 GB memory, right?
Brody
Brody12mo ago
correct critical worker timeout means that your code took longer than 30 seconds (the default timeout) to respond to a request, how long should your app take to respond to a request?
andrews46
andrews4612mo ago
generally less than a second or two, we do have one endpoint that is longer running the thing is, I increased the gunicorn timeout to 1000, and then the restarts stopped, but the server was unresponsive when I tried to hit an endpoint that should take less than a second, it would hang instead so I think the timeout was indicating a deeper issue
Brody
Brody12mo ago
could this be an unhanded error from within your code or maybe you're doing an external API call and that's hanging? because this would be an issue with your app, and not railway specifically
andrews46
andrews4612mo ago
we've also rolled back to a previous revert that had been working perfectly fine, and in general we've never encountered this issue in several weeks of having this project live on railway and to test, we also deployed the same code on heroku that's currently on railway, and it's working fine without any issues
Brody
Brody12mo ago
just because it works locally or on another platform doesn't automatically mean it's an issue with railway but there's also not a whole lot I can do to help you here, you'd need to find out what's causing your app to hang
andrews46
andrews4612mo ago
sure, I'm just trying to think why our previous deploy had been working fine for several days but now when we try to deploy it we're seeing this issue
Brody
Brody12mo ago
failed request to external API? database connection failed? you'll have to do some digging I'm sorry that there's not more I could help you with here
andrews46
andrews4612mo ago
no worries, thanks for getting back to me
Brody
Brody12mo ago
no problem, and when you figure out what's happening I'd love to know about it!
mf82
mf8212mo ago
I had a similar issue couple of days ago, the culprit was the db connection which works only if you use service variables. Hardcoding variables in the code didn’t work for some reason. HTH
andrews46
andrews4612mo ago
So it turned out to be a new version of gunicorn that was released yesterday and introduced a breaking change to our code. We've since locked the versions of our packages in our requirements file and now it's working fine again. Thanks for your help!
Brody
Brody12mo ago
awsome, glad you where able to figure it out, and thanks for coming back and telling us the problem!