R
Railway7mo ago
Ryan!

Intermittent WORKER TIMEOUTS -- appear like a client hanging up

(Sorry for the long post -- I did a lot of digging and wanted to show my work) Attached image shows a large number of [DEBUG] Ignoring EPIPE captured while running e2e tests (filtered for just EPIPE messages) Observations - We have a bit of a long running, but intermittent issue, where our app times out waiting on the client. - My hunch is that the network connection from the Railway proxy to our app hangs, the app patiently waits until the timeout, then aborts with the error [101] [ERROR] Error handling request (no URI read) - We can trigger timeouts when running Cypress end to end tests. Hypothesis is that issuing a lot of requests quickly makes the problem more visible. - The errors happen on different requests, they don’t appear to be spinning doing work. They appear to be waiting. - This happens in all of our Railway environments. It has never happened in local development (not running gunicorn) - New error from Gunicorn with debug logging on Ignoring EPIPE - Found a gunicorn + nginx setting reading about this error proxy_ignore_client_abort - https://github.com/benoitc/gunicorn/issues/1695 - discussion of Nginx and Gunicorn Around the same time, we see an occaional Postgres log too 2024-05-22 18:55:27.971 UTC [402] LOG: could not receive data from client: Connection reset by peer What steps have we taken - Added Sentry to Gunicorn to capture errors - Increased timeout to 120s - Increased workers to 2 - Turned on Gunicorn debug logging
GitHub
gunicorn socket Ignoring EPIPE · Issue #1695 · benoitc/gunicorn
gunicorn == 0.17.4 Django == 1.5.1 nginx version: openresty/1.9.7.3 If I use "proxy_set_header Host $http_host;", django will raise SuspiciousOperation. So I add "proxy_set_header Ho...
No description
31 Replies
Percy
Percy7mo ago
Project ID: 5d77fb37-1fba-4a55-bbd3-e0fab8243723
Ryan!
Ryan!OP7mo ago
5d77fb37-1fba-4a55-bbd3-e0fab8243723 Reverse chronological posts with related issues 4/9/2024 Worker timeout - https://discord.com/channels/713503345364697088/1227240036169416765/1227240036169416765 - Person didn’t follow up, no resolution 3/12/2024 [CRITICAL] WORKER TIMEOUT (pid:27) - https://discord.com/channels/713503345364697088/1217311105173426217/1217311105173426217 - Customer complains of infrequent timeouts - Brody - none of those are error messages, please add error handling and verbose debug logging to your application code so that you can track any issues or anomalies, without such logs there is unfortunately not much I can help with here - 2/5/2024 Django with Gunicorn gives frequent [CRITICAL] WORKER TIMEOUTs on simple requests - https://discord.com/channels/713503345364697088/1204045940286160967/1204045940286160967 - Intermittent issues - Brody - one customer had bad db credentials - Problem with Whitenoise static files, one person moved to Google Cloud Static file hosting 12/30/2023 [CRITICAL] WORKER TIMEOUT (pid:13) error - https://discord.com/channels/713503345364697088/1190838282834153632/1190838282834153632 - Increase timeouts and threads seemed to solve 10/25/2023 Sudden gunicorn errors. App unusable. - https://discord.com/channels/713503345364697088/1166657973716783165/1166657973716783165 - Increased workers and threads --workers=4 --threads=4 5/17/2023 [CRITICAL] WORKER TIMEOUT (pid:155) - https://discord.com/channels/713503345364697088/1108320712239886357/1108320712239886357 - Longer timeout for api calls to OpenAPI 5/2/2023 Flask application contains deploy error WORKER TIMEOUT - https://discord.com/channels/713503345364697088/1102971338387378227/1102971338387378227 - Increase timeout for slow loading app during deploy 4/10/2023 Increasing Worker Timeout - https://discord.com/channels/713503345364697088/1095208516995792987/1095208516995792987 - Brody suggests setting PYTHONUNBUFFERED=1 in the variables pane 3/23/2023 CRITICAL WORKER TIMEOUT ISSUE - https://discord.com/channels/713503345364697088/1088693739163234424/1088693739163234424 - 12/20/2022 worker timeout - https://discord.com/channels/713503345364697088/1054724293512151090/1054724293512151090 - increased timeout in gunicorn, didn’t help, thought it was a memory kill error ^ Showing the messages and I read and summarized before starting this chat. Why this problem is hard Troubleshooting intermittent problems is always hard, especially when it's hard to reproduce them. When running our e2e tests this morning, I was able to get the app to hang several times, when trying again right now, it ran smoothly. Here's a recent stack trace showing where this is happening in gunicorn.
[2024-05-21 20:57:59 +0000] [101] [INFO] Booting worker with pid: 101

[2024-05-21 21:00:00 +0000] [22] [CRITICAL] WORKER TIMEOUT (pid:101)

[2024-05-21 21:00:00 +0000] [101] [ERROR] Error handling request (no URI read)

Traceback (most recent call last):
File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 134, in handle
req = next(parser)
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/parser.py", line 42, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 257, in __init__

super().__init__(cfg, unreader, peer_addr)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 60, in __init__
unused = self.parse(self.unreader)
^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 269, in parse

self.get_data(unreader, buf, stop=True)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 260, in get_data
data = unreader.read()
^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 37, in read
d = self.chunk()
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 64, in chunk
return self.sock.recv(self.mxchunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/base.py", line 203, in handle_abort

sys.exit(1)

SystemExit: 1

[2024-05-21 21:00:00 +0000] [101] [INFO] Worker exiting (pid: 101)
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
[2024-05-21 21:00:01 +0000] [103] [INFO] Booting worker with pid: 103
[2024-05-21 20:57:59 +0000] [101] [INFO] Booting worker with pid: 101

[2024-05-21 21:00:00 +0000] [22] [CRITICAL] WORKER TIMEOUT (pid:101)

[2024-05-21 21:00:00 +0000] [101] [ERROR] Error handling request (no URI read)

Traceback (most recent call last):
File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 134, in handle
req = next(parser)
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/parser.py", line 42, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 257, in __init__

super().__init__(cfg, unreader, peer_addr)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 60, in __init__
unused = self.parse(self.unreader)
^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 269, in parse

self.get_data(unreader, buf, stop=True)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 260, in get_data
data = unreader.read()
^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 37, in read
d = self.chunk()
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 64, in chunk
return self.sock.recv(self.mxchunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/base.py", line 203, in handle_abort

sys.exit(1)

SystemExit: 1

[2024-05-21 21:00:00 +0000] [101] [INFO] Worker exiting (pid: 101)
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
[2024-05-21 21:00:01 +0000] [103] [INFO] Booting worker with pid: 103
Ryan!
Ryan!OP7mo ago
Worker timeout error in sentry. First seen 7 months ago.
No description
Ryan!
Ryan!OP7mo ago
(also if there's a better venue or way to ask about this problem, please let me know)
Brody
Brody7mo ago
first off, a lot of those threads you linked were just user errors but they didn't have any logging of any kind so they had a very hard time narrowing down the issue. second, I really do appreciate all the work you've put into this issue so far! something quick I want you to test, turn on the new edge proxy in your service settings, wait 10 minutes, and try your tests again
Ryan!
Ryan!OP7mo ago
Hi Brody. I saw you on a lot of those threads and saw you asking for more logging and more details. I tried to come prepared to asking for help.
Brody
Brody7mo ago
and you sure did!
Ryan!
Ryan!OP7mo ago
For the new edge proxy, would you do all envs or just some? We have about ~5 open PR envs, a pre-production env, and a production env. I'm definitely down to try new networking because this feels like a hangup at the network layer.
Brody
Brody7mo ago
I guess I'd enable it on the pre-prod? but it's your call
Ryan!
Ryan!OP7mo ago
That works for me. I definitely like testing outside of prod before flipping switches.
Brody
Brody7mo ago
it's also much easier to make tweaks to the new proxy! so it's beneficial for you to be on that Instead of envoy
Ryan!
Ryan!OP7mo ago
Ok, do you need anything from me? Otherwise I'd say please proceed with moving the non-production environments to the new edge proxy.
Brody
Brody7mo ago
oh you don't need me to do anything, go into your service settings and enable the new proxy
Ryan!
Ryan!OP7mo ago
Not every EPIPE has a corresponding WORKER TIMEOUT, but here's a pretty clear one from our e2e tests in a PR environment
No description
Brody
Brody7mo ago
also, I don't work for Railway haha
Ryan!
Ryan!OP7mo ago
Glad i asked. Wait, how have they not hired you?
Brody
Brody7mo ago
not qualified
Ryan!
Ryan!OP7mo ago
Hero status isn't enough?
Brody
Brody7mo ago
lmao
Ryan!
Ryan!OP7mo ago
Project > Settings > Edge Proxy
No description
Brody
Brody7mo ago
that's it!
Ryan!
Ryan!OP7mo ago
(posting for posterity)
Brody
Brody7mo ago
id say you should wait a safe 10 minutes before you are switched over, DNS updates take a bit to propagate so that you know you are on the new proxy you can inspect the Server header in the response headers of a request - railway-edge -> edge proxy railway -> envoy (old proxy)
Ryan!
Ryan!OP7mo ago
Gotta love the DNS magic. I'll get a cup of coffee (and send you a 🚅 ☕)
Brody
Brody7mo ago
thank you very much! ❤️
Ryan!
Ryan!OP7mo ago
I'm excited about the edge proxy. While waiting for DNS, if it doesn't make this go away, what are our next options? You mentioned something about it being more configurable?
it's also much easier to make tweaks to the new proxy
Brody
Brody7mo ago
possibly bring mig in, the person who built the new proxy hey Ryan did the new proxy solve or change anything?
Ryan!
Ryan!OP7mo ago
So far so good. Cypress has been running faster than ever and not getting hung up like it used to nearly every run. I'm about ready to flip on the new proxy in production.
Brody
Brody7mo ago
that's awesome news
Ryan!
Ryan!OP7mo ago
Yeah, this has been bugging me for a while. I've even been looking at my home network configs, thinking it was something with my complicated IOT setup.
Brody
Brody7mo ago
it's always the worst when you start doubting your own hardware
Want results from more Discord servers?
Add your server