(Sorry for the long post -- I did a lot of digging and wanted to show my work)
Attached image shows a large number of [DEBUG] Ignoring EPIPE captured while running e2e tests (filtered for just EPIPE messages)
Observations
- We have a bit of a long running, but intermittent issue, where our app times out waiting on the client.
- My hunch is that the network connection from the Railway proxy to our app hangs, the app patiently waits until the timeout, then aborts with the error
[101] [ERROR] Error handling request (no URI read) - We can trigger timeouts when running Cypress end to end tests. Hypothesis is that issuing a lot of requests quickly makes the problem more visible.
- The errors happen on different requests, they don’t appear to be spinning doing work. They appear to be waiting.
- This happens in all of our Railway environments. It has never happened in local development (not running gunicorn)
- New error from Gunicorn with debug logging on
Ignoring EPIPE- Found a gunicorn + nginx setting reading about this error
proxy_ignore_client_abort - https://github.com/benoitc/gunicorn/issues/1695 - discussion of Nginx and Gunicorn
- Found a gunicorn + nginx setting reading about this error
Around the same time, we see an occaional Postgres log too 2024-05-22 18:55:27.971 UTC [402] LOG: could not receive data from client: Connection reset by peer
What steps have we taken
- Added Sentry to Gunicorn to capture errors
- Increased timeout to 120s
- Increased workers to 2
- Turned on Gunicorn debug logging