R
Railway15mo ago
SP

Services weren't reachable when the deployment failed

(id: 4b49be5c-8a2a-44f2-90e8-28a9de6c457f) Yesterday, we noticed that some of our services weren't reachable when the deployment failed. Isn't railway make sure that the old container will be killed only after the new one is live? Also, the service was returning gateway timeout for few minutes once the deployment succeeded. We are using railway on production and this really affects our users. Please look into this issue.
22 Replies
Percy
Percy15mo ago
Project ID: 4b49be5c-8a2a-44f2-90e8-28a9de6c457f
Brody
Brody15mo ago
for zero downtime deployments you would want to implement a healthcheck in your app that returns a 200 status code once it's determined itself as healthy https://docs.railway.app/deploy/healthchecks
SP
SPOP15mo ago
Thanks for the response Brody. We implemented that long back and healthcheck is there.
Brody
Brody15mo ago
do you have a volume on your service?
SP
SPOP15mo ago
We use volume in only one service. I understand that volume currently results in a small downtime. This is ok for us for now. But all other services does not have volume.
Brody
Brody15mo ago
so how long after the deployment did the deployment fail?
SP
SPOP15mo ago
The deployment was failed because one of the script were broken. Our expectation was that the old container will be up incase if the deployment fails. Later after 10-10:30 CET we foud the gateway timeout errors. I hope you will have access to some logs to see what went wrong.
Brody
Brody15mo ago
railway will do the healthcheck, once thats successful it will switch over to the new deployment and remove the old deployment, if your application crashes after that then you will see downtime, crashes happen so the best thing you can do is make sure your code fully exits as fast as possible with an error code so that railway can restart it, there is also the option of running replicas so that in the event your app locks up railway will route traffic to the other replica
SP
SPOP15mo ago
Please understand that we have healthchecks and the deployment failed: This means the new service didn't start. The healthchecks failed with message: 1/1 replicas never became healthy! Healthcheck failed!
Brody
Brody15mo ago
did the build fail, or did the build succeed and the healthcheck fail? what region do you have your app deployed to?
SP
SPOP15mo ago
The UI was not able to access this service. My expectation was the old service deployment(healthy) will be available. But that was not the case The US-West region (i think that is the default one)
Brody
Brody15mo ago
^
SP
SPOP15mo ago
did the build fail, or did the build succeed and the healthcheck fail? Build was successful, healthcheck failed. Path: /health Retry window: 5m0s Attempt #1 failed with service unavailable. Continuing to retry for 4m59s Attempt #2 failed with service unavailable. Continuing to retry for 4m58s Attempt #3 failed with service unavailable. Continuing to retry for 4m56s Attempt #4 failed with service unavailable. Continuing to retry for 4m52s Attempt #5 failed with service unavailable. Continuing to retry for 4m44s Attempt #6 failed with service unavailable. Continuing to retry for 4m28s Attempt #7 failed with service unavailable. Continuing to retry for 3m58s Attempt #8 failed with service unavailable. Continuing to retry for 3m28s Attempt #9 failed with service unavailable. Continuing to retry for 2m58s Attempt #10 failed with service unavailable. Continuing to retry for 2m28s Attempt #11 failed with service unavailable. Continuing to retry for 1m58s Attempt #12 failed with service unavailable. Continuing to retry for 1m28s Attempt #13 failed with service unavailable. Continuing to retry for 58s Attempt #14 failed with service unavailable. Continuing to retry for 28s 1/1 replicas never became healthy! Healthcheck failed! This is the complete healthcheck log
Brody
Brody15mo ago
then railway would have never swapped it in, your old running deployment would have not been affected if a deployment passes a health check but then later fails there is no fallback deployment in that scenario (talking about the deployment that was running before this deployed failed it's health check)
SP
SPOP15mo ago
Sorry. Let's also not forget that there can be issues in the platform. I have posted this message since we faced "503" error. The error was gone after redeployment. I still got 503 errors after successfully deploying(healthcheck passed) for few more mins.
Brody
Brody15mo ago
I am definitely taking platform issues into consideration, that's why I asked what region, but there have been no issues with the us-west1 region
SP
SPOP15mo ago
Explaining this again: The old service never crashed, We catch all the errors to make sure that a running service won't stop.
Brody
Brody15mo ago
I'm sorry but there where no reported issues with the routing later for us-west1 during the time of your apps outage, if this happens again please report back
SP
SPOP15mo ago
Thanks for your help
Brody
Brody15mo ago
the past issues have been - routing layer in us-east1 failing - builds failing all regions - dashboard 404 throughout all of this already deployed apps in us-west1 went unaffected
SP
SPOP15mo ago
Ok. We faced this issue, I cannot share more details about my services here. Is there anything I can do to get more information(will sending emails to support email address help)?
Brody
Brody15mo ago
at this time it looks to me like this is an issue with your app itself, as there was no issues reported with the routing layer for us-west1, I'm sorry I can't be of more help here
Want results from more Discord servers?
Add your server