Zero Downtime deployments?
Hey there!
I'm experience constant downtime of ~5-20 seconds for the service every time the new version is being deployed. I don't see any logs in the observability during the downtime, so looks like request is not reaching the containers.
I have a health-check endoint and it feels like it's being ignored. If I deliberately start returning 500 from the health check what railway does it shuts down the old deployment and all requests coming towards the app are failing, even though the new deployment healthchecks didn't succeed. Hence the app is just down.
Am I missing something in my setup?
9 Replies
Project ID:
6fa25f10-f91d-41de-83e0-a8a88e5c76b9
6fa25f10-f91d-41de-83e0-a8a88e5c76b9
Not sure how relevannt it is, but between old app shutting down and the new one starting I see this in logs
Couldn't figure it out after a few more attempts, but it's a show-stopper for us, will have to migrate off Railway due to that 😦
I have logs in my health check endpoint and and during the deployment I tried to make requests every second by just constantly refreshing the main page. In the logs here you can see 30s window where the application was down and I saw railway's "application didn't respond" screen
do you have a volume on your service?
yep, I have one volume mounted
https://docs.railway.app/reference/volumes#caveats
To prevent data corruption, we prevent multiple deployments from being active and mounted to the same service. This means that there will be a small amount of downtime when re-deploying a service that has a volume attached
but it also happened prior to attaching the volume, shouldn't be related
I also tried on the separate instance with no volume, same behaviour
This means that there will be a small amountbut nevertheless, small amount is not meant to be 20+ seconds, right?
20 seconds seems a little much, 10 seconds sounds more like what ive personally experienced
but of course, ill do some tests with a basic http server + health check, and get back to you both
heres my report on a switch over test without a volume, no issues from my tests - https://discord.com/channels/713503345364697088/1202753029531766864/1203069079364042843
however the same exact tests but with a volume on the service, i am able to see a railway error page for ~30-40 seconds, so yeah theres something going on here, will ask team!
thank you!