Server becomes unresponsive
Can someone please help me figure out what occurred Saturday? Our production site was not accessible during a live event. We were able to reach the front end, but the data wasn’t loading. We are using express JS with node and a MYSQL database.
Both the back-end and front-end servers are hosted on Railway.app The problem was fixed by redeploying the backend server. The problem has occurred again several times over the past few days, every time it is fixed by redeploying the backend server.
Clues:
During each outage, we have been able to run queries directly on the production database with MySQL workbench. It works like normal.
The SQL command:
SHOW VARIABLES LIKE "max_connections";
returned 151.
The SQL command:
SHOW VARIABLES LIKE "max_used_connections";
returned nothing.
I do not think it is related to the database.
We have a cron job (Node Cron) that runs every minute, it stops running during the outages. We have console.logs in the cron job allowing us to know when it begins and ends. The cron jobs always finish.
I do not think it is related the the cron job.
We use the Morgan logger, which displays the routes we hit in the logs. When going through the logs, we can see when the outage begins because we stop getting status and response time.
Morgan is logging:
:method :url :status :response-time ms - :res[content-length]
5 Replies
Project ID:
N/A
Sample responses before an outage begins:
GET /api/bouts/allBoutsForAnEvent/361 200 2.567 ms - 2
GET /api/registrations/getAllRegisteredCompetitorsForSingleEvent/361 200 6.167 ms - 34436
OPTIONS /api/users/getUserRoleFromToken 200 0.129 ms - 0
OPTIONS /api/wrestlers/getWrestlerRankings/62877 200 0.162 ms - 0
GET /api/wrestlers/getWrestlerProfile/62877 200 4.315 ms - 323
GET /api/wrestlers/getWrestlerWeighInForProfile/62877 200 4.828 ms - 155
GET /api/WARZone/getWrestlingStyles 200 5.608 ms - 241
Sample responses during an outage:
GET /api/events/getEventForCustomerView/367 - - ms - -
GET /api/events/getEventForCustomerView/361 - - ms - -
PUT /api/events/allEventsForCustomerHomePage - - ms - -
GET /api/events/getEventForCustomerView/366 - - ms - -
OPTIONS /api/users/getSingleUser 200 1.465 ms - 0
OPTIONS /api/events/allEventsForCustomerHomePage 200 0.087 ms - 0
GET /api/production/backgroundColorForDevelopmentOrTesting 200 0.192 ms - -
POST /api/users/getSingleUser 500 0.534 ms - 148
I hit the API during an outage with Postman, which results in “stream timeout” after about 5 minutes. There is a large delay from when I send the postman request to when Morgan logs it, as there are many other requests that continue to come in. I shut down the front-end server (so that I know additional requests cannot be coming from anywhere other than Postman), but the requests continued logging for several minutes. This leads me to believe we have a backlog of requests building up.
1fc9ac5f-29f8-4b23-ab28-1eebdaeb8c06
my project ID is: 1fc9ac5f-29f8-4b23-ab28-1eebdaeb8c06
At this point I believe we are being DDOS'd, we're getting an unusual amount of requests to /auth/login
I don't know what to do, does railway provide any DDOS protection?
railway has not had any reported networking or host outages, so im sorry to say but at this point it looks a lot like your app is soft locking, this would be a problem with your code and not the platform, if you are being ddosed in some capacity and your app cant handling the requests you will want to use cloudflare maybe with some specific stricter blocking rules in place, since railway does not provide any ddos protection
Ok, thank you Brody
sorry we cant be of much more help than that