Railway•7mo ago

Intermittent WORKER TIMEOUTS -- appear like a client hanging up

(Sorry for the long post -- I did a lot of digging and wanted to show my work) Attached image shows a large number of [DEBUG] Ignoring EPIPE captured while running e2e tests (filtered for just EPIPE messages) Observations - We have a bit of a long running, but intermittent issue, where our app times out waiting on the client. - My hunch is that the network connection from the Railway proxy to our app hangs, the app patiently waits until the timeout, then aborts with the error [101] [ERROR] Error handling request (no URI read) - We can trigger timeouts when running Cypress end to end tests. Hypothesis is that issuing a lot of requests quickly makes the problem more visible. - The errors happen on different requests, they don’t appear to be spinning doing work. They appear to be waiting. - This happens in all of our Railway environments. It has never happened in local development (not running gunicorn) - New error from Gunicorn with debug logging on Ignoring EPIPE - Found a gunicorn + nginx setting reading about this error proxy_ignore_client_abort - https://github.com/benoitc/gunicorn/issues/1695 - discussion of Nginx and Gunicorn Around the same time, we see an occaional Postgres log too 2024-05-22 18:55:27.971 UTC [402] LOG: could not receive data from client: Connection reset by peer What steps have we taken - Added Sentry to Gunicorn to capture errors - Increased timeout to 120s - Increased workers to 2 - Turned on Gunicorn debug logging

GitHub

gunicorn socket Ignoring EPIPE · Issue #1695 · benoitc/gunicorn

gunicorn == 0.17.4 Django == 1.5.1 nginx version: openresty/1.9.7.3 If I use "proxy_set_header Host $http_host;", django will raise SuspiciousOperation. So I add "proxy_set_header Ho...

31 Replies

Percy•7mo ago

Project ID: 5d77fb37-1fba-4a55-bbd3-e0fab8243723

Ryan!OP•7mo ago

5d77fb37-1fba-4a55-bbd3-e0fab8243723 Reverse chronological posts with related issues 4/9/2024 Worker timeout - https://discord.com/channels/713503345364697088/1227240036169416765/1227240036169416765 - Person didn’t follow up, no resolution 3/12/2024 [CRITICAL] WORKER TIMEOUT (pid:27) - https://discord.com/channels/713503345364697088/1217311105173426217/1217311105173426217 - Customer complains of infrequent timeouts - Brody - none of those are error messages, please add error handling and verbose debug logging to your application code so that you can track any issues or anomalies, without such logs there is unfortunately not much I can help with here - 2/5/2024 Django with Gunicorn gives frequent [CRITICAL] WORKER TIMEOUTs on simple requests - https://discord.com/channels/713503345364697088/1204045940286160967/1204045940286160967 - Intermittent issues - Brody - one customer had bad db credentials - Problem with Whitenoise static files, one person moved to Google Cloud Static file hosting 12/30/2023 [CRITICAL] WORKER TIMEOUT (pid:13) error - https://discord.com/channels/713503345364697088/1190838282834153632/1190838282834153632 - Increase timeouts and threads seemed to solve 10/25/2023 Sudden gunicorn errors. App unusable. - https://discord.com/channels/713503345364697088/1166657973716783165/1166657973716783165 - Increased workers and threads --workers=4 --threads=4 5/17/2023 [CRITICAL] WORKER TIMEOUT (pid:155) - https://discord.com/channels/713503345364697088/1108320712239886357/1108320712239886357 - Longer timeout for api calls to OpenAPI 5/2/2023 Flask application contains deploy error WORKER TIMEOUT - https://discord.com/channels/713503345364697088/1102971338387378227/1102971338387378227 - Increase timeout for slow loading app during deploy 4/10/2023 Increasing Worker Timeout - https://discord.com/channels/713503345364697088/1095208516995792987/1095208516995792987 - Brody suggests setting PYTHONUNBUFFERED=1 in the variables pane 3/23/2023 CRITICAL WORKER TIMEOUT ISSUE - https://discord.com/channels/713503345364697088/1088693739163234424/1088693739163234424 - 12/20/2022 worker timeout - https://discord.com/channels/713503345364697088/1054724293512151090/1054724293512151090 - increased timeout in gunicorn, didn’t help, thought it was a memory kill error ^ Showing the messages and I read and summarized before starting this chat. Why this problem is hard Troubleshooting intermittent problems is always hard, especially when it's hard to reproduce them. When running our e2e tests this morning, I was able to get the app to hang several times, when trying again right now, it ran smoothly. Here's a recent stack trace showing where this is happening in gunicorn.

[2024-05-21 20:57:59 +0000] [101] [INFO] Booting worker with pid: 101

[2024-05-21 21:00:00 +0000] [22] [CRITICAL] WORKER TIMEOUT (pid:101)

[2024-05-21 21:00:00 +0000] [101] [ERROR] Error handling request (no URI read)

Traceback (most recent call last):
File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 134, in handle
req = next(parser)
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/parser.py", line 42, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 257, in __init__

super().__init__(cfg, unreader, peer_addr)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 60, in __init__
unused = self.parse(self.unreader)
^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 269, in parse

self.get_data(unreader, buf, stop=True)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 260, in get_data
data = unreader.read()
^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 37, in read
d = self.chunk()
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 64, in chunk
return self.sock.recv(self.mxchunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/base.py", line 203, in handle_abort

sys.exit(1)

SystemExit: 1

[2024-05-21 21:00:00 +0000] [101] [INFO] Worker exiting (pid: 101)
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
[2024-05-21 21:00:01 +0000] [103] [INFO] Booting worker with pid: 103

[2024-05-21 20:57:59 +0000] [101] [INFO] Booting worker with pid: 101

[2024-05-21 21:00:00 +0000] [22] [CRITICAL] WORKER TIMEOUT (pid:101)

[2024-05-21 21:00:00 +0000] [101] [ERROR] Error handling request (no URI read)

Traceback (most recent call last):
File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 134, in handle
req = next(parser)
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/parser.py", line 42, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 257, in __init__

super().__init__(cfg, unreader, peer_addr)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 60, in __init__
unused = self.parse(self.unreader)
^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 269, in parse

self.get_data(unreader, buf, stop=True)

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/message.py", line 260, in get_data
data = unreader.read()
^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 37, in read
d = self.chunk()
^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 64, in chunk
return self.sock.recv(self.mxchunk)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/opt/venv/lib/python3.11/site-packages/gunicorn/workers/base.py", line 203, in handle_abort

sys.exit(1)

SystemExit: 1

[2024-05-21 21:00:00 +0000] [101] [INFO] Worker exiting (pid: 101)
Sentry is attempting to send 2 pending events
Waiting up to 2 seconds
Press Ctrl-C to quit
[2024-05-21 21:00:01 +0000] [103] [INFO] Booting worker with pid: 103

Ryan!OP•7mo ago

Worker timeout error in sentry. First seen 7 months ago.

Ryan!OP•7mo ago

(also if there's a better venue or way to ask about this problem, please let me know)

Brody•7mo ago

first off, a lot of those threads you linked were just user errors but they didn't have any logging of any kind so they had a very hard time narrowing down the issue. second, I really do appreciate all the work you've put into this issue so far! something quick I want you to test, turn on the new edge proxy in your service settings, wait 10 minutes, and try your tests again

Ryan!OP•7mo ago

Hi Brody. I saw you on a lot of those threads and saw you asking for more logging and more details. I tried to come prepared to asking for help.

Brody•7mo ago

and you sure did!

Ryan!OP•7mo ago

For the new edge proxy, would you do all envs or just some? We have about ~5 open PR envs, a pre-production env, and a production env. I'm definitely down to try new networking because this feels like a hangup at the network layer.

Brody•7mo ago

I guess I'd enable it on the pre-prod? but it's your call

Ryan!OP•7mo ago

That works for me. I definitely like testing outside of prod before flipping switches.

Brody•7mo ago

it's also much easier to make tweaks to the new proxy! so it's beneficial for you to be on that Instead of envoy

Ryan!OP•7mo ago

Ok, do you need anything from me? Otherwise I'd say please proceed with moving the non-production environments to the new edge proxy.

Brody•7mo ago

oh you don't need me to do anything, go into your service settings and enable the new proxy

Ryan!OP•7mo ago

Not every EPIPE has a corresponding WORKER TIMEOUT, but here's a pretty clear one from our e2e tests in a PR environment

Brody•7mo ago

also, I don't work for Railway haha

Ryan!OP•7mo ago

Glad i asked. Wait, how have they not hired you?

Brody•7mo ago

not qualified

Ryan!OP•7mo ago

Hero status isn't enough?

Brody•7mo ago

lmao

Ryan!OP•7mo ago

Project > Settings > Edge Proxy

Brody•7mo ago

that's it!

Ryan!OP•7mo ago

(posting for posterity)

Brody•7mo ago

id say you should wait a safe 10 minutes before you are switched over, DNS updates take a bit to propagate so that you know you are on the new proxy you can inspect the Server header in the response headers of a request - railway-edge -> edge proxy railway -> envoy (old proxy)

Ryan!OP•7mo ago

Gotta love the DNS magic. I'll get a cup of coffee (and send you a 🚅 ☕)

Brody•7mo ago

thank you very much! ❤️

Ryan!OP•7mo ago

I'm excited about the edge proxy. While waiting for DNS, if it doesn't make this go away, what are our next options? You mentioned something about it being more configurable?

it's also much easier to make tweaks to the new proxy

Brody•7mo ago

possibly bring mig in, the person who built the new proxy hey Ryan did the new proxy solve or change anything?

Ryan!OP•7mo ago

So far so good. Cypress has been running faster than ever and not getting hung up like it used to nearly every run. I'm about ready to flip on the new proxy in production.

Brody•7mo ago

that's awesome news

Ryan!OP•7mo ago

Yeah, this has been bugging me for a while. I've even been looking at my home network configs, thinking it was something with my complicated IOT setup.

Brody•7mo ago

it's always the worst when you start doubting your own hardware

Gaming

Programming

Intermittent WORKER TIMEOUTS -- appear like a client hanging up