Railway•15mo ago

Services weren't reachable when the deployment failed

(id: 4b49be5c-8a2a-44f2-90e8-28a9de6c457f) Yesterday, we noticed that some of our services weren't reachable when the deployment failed. Isn't railway make sure that the old container will be killed only after the new one is live? Also, the service was returning gateway timeout for few minutes once the deployment succeeded. We are using railway on production and this really affects our users. Please look into this issue.

22 Replies

Percy•15mo ago

Project ID: 4b49be5c-8a2a-44f2-90e8-28a9de6c457f

Brody•15mo ago

for zero downtime deployments you would want to implement a healthcheck in your app that returns a 200 status code once it's determined itself as healthy https://docs.railway.app/deploy/healthchecks

SPOP•15mo ago

Thanks for the response Brody. We implemented that long back and healthcheck is there.

Brody•15mo ago

do you have a volume on your service?

SPOP•15mo ago

We use volume in only one service. I understand that volume currently results in a small downtime. This is ok for us for now. But all other services does not have volume.

Brody•15mo ago

so how long after the deployment did the deployment fail?

SPOP•15mo ago

The deployment was failed because one of the script were broken. Our expectation was that the old container will be up incase if the deployment fails. Later after 10-10:30 CET we foud the gateway timeout errors. I hope you will have access to some logs to see what went wrong.

Brody•15mo ago

railway will do the healthcheck, once thats successful it will switch over to the new deployment and remove the old deployment, if your application crashes after that then you will see downtime, crashes happen so the best thing you can do is make sure your code fully exits as fast as possible with an error code so that railway can restart it, there is also the option of running replicas so that in the event your app locks up railway will route traffic to the other replica

SPOP•15mo ago

Please understand that we have healthchecks and the deployment failed: This means the new service didn't start. The healthchecks failed with message: 1/1 replicas never became healthy! Healthcheck failed!

Brody•15mo ago

did the build fail, or did the build succeed and the healthcheck fail? what region do you have your app deployed to?

SPOP•15mo ago

The UI was not able to access this service. My expectation was the old service deployment(healthy) will be available. But that was not the case The US-West region (i think that is the default one)

Brody•15mo ago

SPOP•15mo ago

did the build fail, or did the build succeed and the healthcheck fail? Build was successful, healthcheck failed. Path: /health Retry window: 5m0s Attempt #1 failed with service unavailable. Continuing to retry for 4m59s Attempt #2 failed with service unavailable. Continuing to retry for 4m58s Attempt #3 failed with service unavailable. Continuing to retry for 4m56s Attempt #4 failed with service unavailable. Continuing to retry for 4m52s Attempt #5 failed with service unavailable. Continuing to retry for 4m44s Attempt #6 failed with service unavailable. Continuing to retry for 4m28s Attempt #7 failed with service unavailable. Continuing to retry for 3m58s Attempt #8 failed with service unavailable. Continuing to retry for 3m28s Attempt #9 failed with service unavailable. Continuing to retry for 2m58s Attempt #10 failed with service unavailable. Continuing to retry for 2m28s Attempt #11 failed with service unavailable. Continuing to retry for 1m58s Attempt #12 failed with service unavailable. Continuing to retry for 1m28s Attempt #13 failed with service unavailable. Continuing to retry for 58s Attempt #14 failed with service unavailable. Continuing to retry for 28s 1/1 replicas never became healthy! Healthcheck failed! This is the complete healthcheck log

Brody•15mo ago

then railway would have never swapped it in, your old running deployment would have not been affected if a deployment passes a health check but then later fails there is no fallback deployment in that scenario (talking about the deployment that was running before this deployed failed it's health check)

SPOP•15mo ago

Sorry. Let's also not forget that there can be issues in the platform. I have posted this message since we faced "503" error. The error was gone after redeployment. I still got 503 errors after successfully deploying(healthcheck passed) for few more mins.

Brody•15mo ago

I am definitely taking platform issues into consideration, that's why I asked what region, but there have been no issues with the us-west1 region

SPOP•15mo ago

Explaining this again: The old service never crashed, We catch all the errors to make sure that a running service won't stop.

Brody•15mo ago

I'm sorry but there where no reported issues with the routing later for us-west1 during the time of your apps outage, if this happens again please report back

SPOP•15mo ago

Thanks for your help

Brody•15mo ago

the past issues have been - routing layer in us-east1 failing - builds failing all regions - dashboard 404 throughout all of this already deployed apps in us-west1 went unaffected

SPOP•15mo ago

Ok. We faced this issue, I cannot share more details about my services here. Is there anything I can do to get more information(will sending emails to support email address help)?

Brody•15mo ago

at this time it looks to me like this is an issue with your app itself, as there was no issues reported with the routing layer for us-west1, I'm sorry I can't be of more help here

Gaming

Programming

Services weren't reachable when the deployment failed