R
Railway8mo ago
sergey

Zero Downtime deployments?

Hey there! I'm experience constant downtime of ~5-20 seconds for the service every time the new version is being deployed. I don't see any logs in the observability during the downtime, so looks like request is not reaching the containers. I have a health-check endoint and it feels like it's being ignored. If I deliberately start returning 500 from the health check what railway does it shuts down the old deployment and all requests coming towards the app are failing, even though the new deployment healthchecks didn't succeed. Hence the app is just down. Am I missing something in my setup?
9 Replies
Percy
Percy8mo ago
Project ID: 6fa25f10-f91d-41de-83e0-a8a88e5c76b9
sergey
sergey8mo ago
6fa25f10-f91d-41de-83e0-a8a88e5c76b9 Not sure how relevannt it is, but between old app shutting down and the new one starting I see this in logs
ELIFECYCLE  Command failed.

> @ start /app

> node .output/server/index.mjs

Listening on http://[::]:7419
ELIFECYCLE  Command failed.

> @ start /app

> node .output/server/index.mjs

Listening on http://[::]:7419
Couldn't figure it out after a few more attempts, but it's a show-stopper for us, will have to migrate off Railway due to that 😦
sergey
sergey8mo ago
I have logs in my health check endpoint and and during the deployment I tried to make requests every second by just constantly refreshing the main page. In the logs here you can see 30s window where the application was down and I saw railway's "application didn't respond" screen
No description
Brody
Brody8mo ago
do you have a volume on your service?
sergey
sergey8mo ago
yep, I have one volume mounted
Brody
Brody8mo ago
https://docs.railway.app/reference/volumes#caveats
To prevent data corruption, we prevent multiple deployments from being active and mounted to the same service. This means that there will be a small amount of downtime when re-deploying a service that has a volume attached
sergey
sergey8mo ago
but it also happened prior to attaching the volume, shouldn't be related I also tried on the separate instance with no volume, same behaviour
This means that there will be a small amount
but nevertheless, small amount is not meant to be 20+ seconds, right?
Brody
Brody8mo ago
20 seconds seems a little much, 10 seconds sounds more like what ive personally experienced but of course, ill do some tests with a basic http server + health check, and get back to you both heres my report on a switch over test without a volume, no issues from my tests - https://discord.com/channels/713503345364697088/1202753029531766864/1203069079364042843 however the same exact tests but with a volume on the service, i am able to see a railway error page for ~30-40 seconds, so yeah theres something going on here, will ask team!
sergey
sergey8mo ago
thank you!
Want results from more Discord servers?
Add your server