Railway•10mo ago

Zero Downtime deployments?

Hey there! I'm experience constant downtime of ~5-20 seconds for the service every time the new version is being deployed. I don't see any logs in the observability during the downtime, so looks like request is not reaching the containers. I have a health-check endoint and it feels like it's being ignored. If I deliberately start returning 500 from the health check what railway does it shuts down the old deployment and all requests coming towards the app are failing, even though the new deployment healthchecks didn't succeed. Hence the app is just down. Am I missing something in my setup?

9 Replies

Percy•10mo ago

Project ID: 6fa25f10-f91d-41de-83e0-a8a88e5c76b9

sergeyOP•10mo ago

6fa25f10-f91d-41de-83e0-a8a88e5c76b9 Not sure how relevannt it is, but between old app shutting down and the new one starting I see this in logs

ELIFECYCLE  Command failed.

> @ start /app

> node .output/server/index.mjs

Listening on http://[::]:7419

ELIFECYCLE  Command failed.

> @ start /app

> node .output/server/index.mjs

Listening on http://[::]:7419

Couldn't figure it out after a few more attempts, but it's a show-stopper for us, will have to migrate off Railway due to that 😦

sergeyOP•10mo ago

I have logs in my health check endpoint and and during the deployment I tried to make requests every second by just constantly refreshing the main page. In the logs here you can see 30s window where the application was down and I saw railway's "application didn't respond" screen

Brody•10mo ago

do you have a volume on your service?

sergeyOP•10mo ago

yep, I have one volume mounted

Brody•10mo ago

https://docs.railway.app/reference/volumes#caveats

To prevent data corruption, we prevent multiple deployments from being active and mounted to the same service. This means that there will be a small amount of downtime when re-deploying a service that has a volume attached

sergeyOP•10mo ago

but it also happened prior to attaching the volume, shouldn't be related I also tried on the separate instance with no volume, same behaviour

This means that there will be a small amount

but nevertheless, small amount is not meant to be 20+ seconds, right?

Brody•10mo ago

20 seconds seems a little much, 10 seconds sounds more like what ive personally experienced but of course, ill do some tests with a basic http server + health check, and get back to you both heres my report on a switch over test without a volume, no issues from my tests - https://discord.com/channels/713503345364697088/1202753029531766864/1203069079364042843 however the same exact tests but with a volume on the service, i am able to see a railway error page for ~30-40 seconds, so yeah theres something going on here, will ask team!

sergeyOP•10mo ago

thank you!

Gaming

Programming

Zero Downtime deployments?