Railway•10mo ago

Render Error on new deployment

On a new deployment, there's always a brief period of time (5-25 seconds) where the website intermittently returns an error with Railway theme and some generic error text (I forget what it is). Is that behavior expected?

129 Replies

Percy•10mo ago

Project ID: 87f6d50b-7bab-488e-b802-02f9edc442e3

Brody•10mo ago

Render's theme?

SHxKMOP•10mo ago

87f6d50b-7bab-488e-b802-02f9edc442e3 Sorry! Fixed the text Technically it's some sort "render error" so I hope the title can stay

Brody•10mo ago

can you get a screenshot of this?

SHxKMOP•10mo ago

The issue is on Railway though sure Next time it happens

Zazh•10mo ago

maybe the like not found error?

Brody•10mo ago

does it have a railway logo?

SHxKMOP•10mo ago

yes

Brody•10mo ago

application not responding?

SHxKMOP•10mo ago

this is what I meant by "theme" I doubt that, Railway promotes the new deployment long after it has succeeded I'd assume

Brody•10mo ago

do you have a volume on your service?

SHxKMOP•10mo ago

There it is

SHxKMOP•10mo ago

Well on two of them: Postgres and Redis

Brody•10mo ago

but do you have a volume on that service though

SHxKMOP•10mo ago

on my web service? no

Brody•10mo ago

are you using a health check?

SHxKMOP•10mo ago

Brody•10mo ago

then theres your problem, without a health check railway doesnt know exactly when your app is ready to handle requests what kind of web service?

SHxKMOP•10mo ago

it's a Django app. Well I thought since the deployment is only turning green after the Gunicorn worker is up, then everything should be properly set up. Sometimes it happens for over 10 seconds, which is really making me think this isn't a healthcheck/no healthcheck issue.

Brody•10mo ago

turns green when the container is ran, not necessarily when your app can handle a request, go ahead and add a health check just to get that out of the way

SHxKMOP•10mo ago

will do sir

Brody•10mo ago

and you are sure the django service doesnt have a volume?

SHxKMOP•10mo ago

Is this proof?

Brody•10mo ago

yes haha

SHxKMOP•10mo ago

Is the health check path relative?

Brody•10mo ago

never heard someone use that term for a url lol

SHxKMOP•10mo ago

point taken should I use my domain then, or Railway's? or either?

Brody•10mo ago

it only accepts a path, like /api/v1/healthz

SHxKMOP•10mo ago

huh, failing for 5 straight minutes for some reason...Even though all it does is return an empty 200 response (not even checking databases or anything)

Brody•10mo ago

guess you got the path wrong or something?

SHxKMOP•10mo ago

I don't understand. My healthcheck path is up, I tried /up/, up and up/ None worked If I access mydomain.com/up/ - it returns 200 On the other hand, I do see gunicorn returning 400 for the healthcheck tests. So weird..

Brody•10mo ago

you likely have some middleware interfering, like checking allowed hosts on it

SHxKMOP•10mo ago

just added ...railway.internal there

Brody•10mo ago

the health checks are done from local ipv4 addresses

SHxKMOP•10mo ago

Aha I guess the solution isn't to whitelist every host right? how do I know what to whitelist?

Brody•10mo ago

dont have any middleware run for the health check path

SHxKMOP•10mo ago

huh, so basically for that path, have ALLOWED_HOSTS = ["*"]? This looks like it should be simpler than this

Brody•10mo ago

thats django for you

SHxKMOP•10mo ago

Hah that's a bit cheeky Thanks a lot for your help!

Brody•10mo ago

you got a health check working?

SHxKMOP•10mo ago

nope my website gets like 20 visits a day right now, 20 of are by me so I'm just gonna put that on the TODO list have to find a proper way to do this

Brody•10mo ago

sounds good

SHxKMOP•10mo ago

@Brody this is after setting up the healthcheck successfully:

SHxKMOP•10mo ago

Yeah there's definitely something happening once there are two green deployments, and the older one is removed. I can reproduce it quite consistently.

sergey•10mo ago

Btw, that's the 3rd report of the same issue in the last couple days. Same thing I posted in the other thread https://discord.com/channels/713503345364697088/1202585121677643836 Might be some global problem?

SHxKMOP•10mo ago

Exactly the symptoms I’m experiencing (after adding a proper health check)

Brody•10mo ago

you definitely could get into a scenario where your app responds to a health check but not to actual traffic

SHxKMOP•10mo ago

@Brody can you please elaborate? If I temporarily move my health-check to be the very same path that I receive "actual" traffic on, will that be enough to investigate on your side?

Brody•10mo ago

haha I don't have any other side than the community side, I don't work for Railway

SHxKMOP•10mo ago

I thought you did honestly. But by side I meant “end”. It’s not about sides it’s about whether issues are investigated properly. Where can I raise an issue regarding this? As @sergey said, this is not an isolated incident.

Brody•10mo ago

I'll try to reproduce

SHxKMOP•10mo ago

Total guess but what I think is happening is Railway right after the switch sends some of the requests to the terminated/to be terminated deployment. As Sergey said, a period of 5-20 seconds where we see this error. This is after the new deployment has responded successfully to the health check.

Brody•10mo ago

my guess is that django is doing some unwanted behaviour, same with the app Sergey is running, I'll try to reproduce with a simple http server with no middleware stacks or anything of the sort

SHxKMOP•10mo ago

So two tried frameworks are doing unwanted behavior? Maybe it’s the third 😉

Brody•10mo ago

but for transparency, if you have a volume (you don't but Sergey might) there will be downtime as two services can't connect to the same volume as the same time

sergey•10mo ago

I have a node server, btw, not Jango

Brody•10mo ago

just did a few back to back tests for a basic http server with health check, no volume. during the period of switch over, at a refresh rate of 250ms and cache disabled, i only saw a singular flash of the railway page

SHxKMOP•10mo ago

Well that is definitely not my experience. Which web server were you using? If this is something common to Node.js and Gunicorn/Django, then it must be an obvious config step

Brody•10mo ago

i am running a golang stdlib http server, with the chi router but heres the update on sergey's issue https://discord.com/channels/713503345364697088/1202585121677643836/1203077349432758289 keep in mind, they are using a volume, so their issue does not apply to you, since you are not using a volume

SHxKMOP•10mo ago

Very interesting. Believe it or not I’m relieved to know it’s probably a misconfig on my part.

Brody•10mo ago

are you using a readiness type health check? aka a health check that confirms your app is talking to your database of course im not, but my test app doesnt talk to a database

SHxKMOP•10mo ago

No. I’m just returning a 200 from the Django middleware since Railway is using a random IP each time, but I’ll add those checks in a bit

Brody•10mo ago

your health check should be made not to care who or what is making the health check

SHxKMOP•10mo ago

I don't necessarily agree with that. The ALLOWED_HOSTS setting is an important one, and "exempting" the health-check endpoint from encforcement is a workaround to make it work with Railway.

Brody•10mo ago

then it would be a work around to work with any similar hosting service, railway isnt going to make the request with a masked host header, the health check should be relatively dumb

SHxKMOP•10mo ago

At least on the good news front, the latest deployment didn't show this kind of behavior when I ensured Redis + DB connections before returning 200:

if request.path in self.EXEMPT_PATHS:
    redis.ping()
    connection.ensure_connection()

if request.path in self.EXEMPT_PATHS:
    redis.ping()
    connection.ensure_connection()

I'll keep an eye on this for the next few days for sure

Brody•10mo ago

sounds good!

SHxKMOP•10mo ago

Thanks again Mr. @Brody

Brody•10mo ago

always happy to help

SHxKMOP•10mo ago

Yeah, spoke too soon:

Brody•10mo ago

is your previous deployment getting shut down before the new deployment is live? check its state during the transition

SHxKMOP•10mo ago

But now at least I see this:

192.168.0.2 - - [02/Feb/2024:21:16:07 +0000] 'GET / HTTP/1.1' 500 145 'https://xxxxxxxx.com/'; 'Mozilla/5.0 (XXXX) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36' in 296216µs

192.168.0.2 - - [02/Feb/2024:21:16:07 +0000] 'GET / HTTP/1.1' 500 145 'https://xxxxxxxx.com/'; 'Mozilla/5.0 (XXXX) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36' in 296216µs

So it's definitely on me It's only getting shut down after the new deployment is green and shows active, so I assume something is going wrong on my side.

Brody•10mo ago

but whats its state after the failed health check timeout limit? active, complete?

SHxKMOP•10mo ago

The health check is succeeding.

Brody•10mo ago

oh my bad, that log line is not an access log for the health check request

SHxKMOP•10mo ago

And I actually think we still have an issue here, I just happened to deploy a bug now at the same time 🙂 No, that's guincorn

Brody•10mo ago

bad wording, fixed

SHxKMOP•10mo ago

Yeah, something is definitely up. I think I'll just record a video to prove it, because at times (this isn't the first time it happened), the response goes like this: Railway "app failed" screen Railway "app failed" screen 200 Railway "app failed" screen Railway "app failed" screen 200 And from there it sorts itself out

Brody•10mo ago

i saw that, but only during my tests with a volume

SHxKMOP•10mo ago

We've already established I'm not using those for the deployed service. May I ask if you preformed the test more than once? I see this happening for 50-70% of deployments, it's not 100% of the time.

Brody•10mo ago

yes I've done it multiple times

SHxKMOP•10mo ago

Well, maybe the same issue in https://discord.com/channels/713503345364697088/1202585121677643836/1203077349432758289 is also affecting me here.

Brody•10mo ago

i just cant reproduce it

SHxKMOP•10mo ago

How can I raise a support ticket? When Gunicorn/Django return 500, A “classic” 500 Server Error is displayed. Black text on white background. This isn’t what’s happening here. Railway is routing traffic to a deployment it shouldn’t route to. I deployed a bug earlier that caused the app to consistently return 500. When that happens, a regular server error is returned, without Railway’s theme.

Brody•10mo ago

as a hobby user you get community support why is django returning 500 anyway

SHxKMOP•10mo ago

I intentionally (OK let’s pretend that it was intentional) inserted a bug. So Django throws an exception, and Gunicron returns 500 It’s different from the error page displayed by Railway Which tells me it’s a routing issue

Brody•10mo ago

if you see the railway page during normal operation that means your app didnt answer railway's proxy request, therefore the error lies with your app, heres proof of that https://utilities.up.railway.app/status-code/500

SHxKMOP•10mo ago

I don’t see it during normal operation, that’s the point. I see it during Railway’s deployment This is what I’m trying to say

Brody•10mo ago

during the transition period?

SHxKMOP•10mo ago

Yes

Brody•10mo ago

do you see a build log that the health check succeeded?

SHxKMOP•10mo ago

Yes And 200 from Gunicorn logs for the health check path

Brody•10mo ago

how many tries until first success?

SHxKMOP•10mo ago

2-3 times with 503, then succeeds I don’t think it’s the new instance that’s not “answering” the proxy request. I think some traffic for a short period of time is directed to the old instance. M

Brody•10mo ago

during these health checks of the new deployment, what is the status of the previous deployment

SHxKMOP•10mo ago

Active

Brody•10mo ago

and it stays active until the new deployment is switched in? can you triple check this for me

SHxKMOP•10mo ago

I will right now But what does “switched in” mean here: becomes “green” colored? Turns itself to “Active”?

Brody•10mo ago

correct i shall try my tests again but with an artificial health delay where i return 503 for the first 5 seconds sound like a more appropriate test?

SHxKMOP•10mo ago

Let me document what’s happening: So once the new deployment kicks in (building), the other is green but it doesn’t say Active

SHxKMOP•10mo ago

Here's this state

SHxKMOP•10mo ago

Let me capture the state when they're both green

SHxKMOP•10mo ago

In this picture, the upper one is the new deployment, which just succeeded its health-check.

SHxKMOP•10mo ago

Build logs for the new deployment:

SHxKMOP•10mo ago

Deploy logs for the new deployment:

SHxKMOP•10mo ago

Deploy logs for the old deployment:

SHxKMOP•10mo ago

The issues definitely start AFTER the old deployment is just removed, while both deployments are green, everything works fine. But for a (not so) short period of time, once the old deployment is moved to "HISTORY", the Railway screen of death appears.

SHxKMOP•10mo ago

Browser console, don't know what that :1 means...hope it's not the port:

Brody•10mo ago

it means the first line of that file lol

SHxKMOP•10mo ago

yeah, so this is what I have. I'll upgrade to Pro temporarily to get this looked at if that's needed. I don't see how a tried and tested server like Gunicorn returns 503 or doesn't respond after it has booted up and returned 200 already.

Brody•10mo ago

going to test an artificial health check delay, will get back to you

SHxKMOP•10mo ago

I'm not optimistic about this, as you can see it took Gunicorn 2 seconds I'd test two things here: Either Gunicron itself, with its default graceful shutdown configurations. Or a server that sleeps for 30 seconds on SIGTERM I'm close to convinced the terminating instance is receiving traffic still, but it has already hung up. Or something along those lines.

Brody•10mo ago

you may be able to get railway to wait 30 seconds after sending sigterm before force killing the old container, but gunicorn isnt going to answer requests after sigterm anyway

SHxKMOP•10mo ago

I don't really care, I can configure it to go dead immediately. My question is: is this behavior tripping up Railway One more thing to note: I am pre-tty sure that while both deployments are green (so just as the new deployment becomes healthy), Railway is still directing traffic exclusively to the previous (still green) deployment. I guess this is desired behavior, but thought I'd mention that because I don't know what's right and wrong anymore.

Brody•10mo ago

thats correct from my understanding, railway routes traffic to the previous deployment for a default of 20 seconds, then kills the old deployment and switches over after 20 seconds or when the new deployments health check succeeds, whatever comes last

SHxKMOP•10mo ago

Got it. Well, I’m still hopeful @sergey’s thread investigation by support will bring results for services without volume mounts as well. He also reported the issue occurs without a volume attached.

Brody•10mo ago

i can reproduce your issue when the health check does not succeed right away, doing some more tests

SHxKMOP•10mo ago

Oh this is interesting

Brody•10mo ago

i set RAILWAY_DEPLOYMENT_OVERLAP_SECONDS to 35 and did a bunch more test runs, with that set to 35, i didnt even get a single flicker of the railway page

SHxKMOP•10mo ago

I’ll try that early tomorrow. What does this env var mean exactly? Why do you think it solves the issue?

Brody•10mo ago

it's explained a bit in railways docs, but dinner time now so I can't link

SHxKMOP•10mo ago

Bon appetite wrote and deployed much less today. But it seems that 31 seconds makes this go away as well. @Brody Should this be raised for support anyway in your opinion? I mean, the health check returned 200, why do I need to change an ENV VAR to make zero downtime deployments zero downtime?

Brody•10mo ago

yeah i set to 35 since i like jumping by 5 😆 yes i will bring this to the teams attention this week, i have some theories on why this is happening, but i would need the team to confirm. theres also the likely hood that even though this is brought to the the teams attention the fix would still be you setting that variable to 31 on account of they are replacing their current proxy with a entirely new built in house proxy that would be very likely fix this issue anyway, no point in patching their current proxy (when there is a work around) if its going to be ripped out anyway

SHxKMOP•10mo ago

I would just like confirmation that something is amiss, and some way to track it, so I know when to remove the patch.

Brody•10mo ago

if I can reproduce it, something is amiss 😆 but if I hear anything I'll be sure to tell you

SHxKMOP•10mo ago

@Brody weren’t you able to repro without the ENV variable, when the first (few) health check(s) fail?

Brody•10mo ago

yeah, why? ^

SHxKMOP•10mo ago

I shouldn't read anything after 23:00. I totally misinterpreted this message.

Brody•10mo ago

haha no worries

Gaming

Programming

Render Error on new deployment