Railway•6mo ago

We need serious help to continue using railway. Server is halting often and we are not sure why.

Every so often we see "Application Failed to Respond". There is no log about a crash or fail in railway logs. We are not sure if the app actually failed. We do not see any logs to why this is failing. Nor is there logs of server restarting. We have a log for server start Server running on PORT ${PORT} But this is missing when the app is back up as well. Ram and CPU usage seems normal as just before the app is turned off. PLEASE help us. We have tried everythig, sprinkeled logs everywhere, added try catch everywhere. At this point the only thing left to move is to try a different host from our end. Just in the last few hours the app went off 2 times and it takes a while to come back up Project id: 6d0f799e-be59-4388-899d-f00456f30667 I have a post for this already: https://discord.com/channels/713503345364697088/1246339314829492294/1246339314829492294

100 Replies

Percy•6mo ago

Project ID: 6d0f799e-be59-4388-899d-f00456f30667

Brody•6mo ago

are you using the legacy or the v2 runtime?

KiBenderOP•6mo ago

How do I check? Maybe. This is an old app

Brody•6mo ago

in the service settings

KiBenderOP•6mo ago

Yes

Brody•6mo ago

switch to v2

KiBenderOP•6mo ago

Will this fix the issue?

Brody•6mo ago

I make absolutely no promises

KiBenderOP•6mo ago

If the server is restarting I should see "Server running on PORT ${PORT}" again right? That is not happening. Its just logs dissapear then after a while it comes back up

Brody•6mo ago

yes that is correct, that's the behaviour you should see if the app is restarting, but with the information I have this seems like your app is locking up. try the v2 runtime and report back

KiBenderOP•6mo ago

Can you explain what is "locking up"?

Brody•6mo ago

soft locking, Google could explain the term better than I could

KiBenderOP•6mo ago

Okay got it. is there something we should do in the app level to avoid this?

Brody•6mo ago

I wouldn't be able to tell you that, there's a million different things that can cause code to lock up, but definitely try the v2 runtime!

angelo•6mo ago

Hey there @KiBender - I know you’ve raised support questions in the past and I am sorry to continually make you rehash information. Are you perchance using Prisma or managing a large amount of DB connections? There are a few things in flight that we are flighting to fix stability on the platform and I don’t wanna rule out Railway but wanna make sure I am able to gather the properties of your app. (Such as V2 Runtime and V2 Proxy)

KiBenderOP•6mo ago

Yes we are using prisma. But we switched to a new database using kysely from you guys suggestion for heavy memory usage in prisma Prisma for app settings kysely for data both seperate db

angelo•6mo ago

Fair and noted, do you see the restarts on load or just randomly?

KiBenderOP•6mo ago

Randomly, We've been trying to associate the last logs and trying to work around from there but honestly everytime its different and already handled errors

angelo•6mo ago

Also for the DB connections, are you using the Internal network? (I think you are) If random then I have a strong suspicion that the new runtime would help then.

KiBenderOP•6mo ago

"monorail.proxy.rlwy.net" I think sometime in the past to debug this we moved to the non internal network and viaduct.proxy.rlwy.net Both are external url right?

angelo•6mo ago

I would suggest you move to the Internal network so you don’t get hit with egress charges but also, we control the network not GCP and we have been continually mitigating public connection issues. (Knock on wood none more yet but you never know.)

KiBenderOP•6mo ago

I will do that, will try to switch to internal urls

Brody•6mo ago

(and the v2 runtime)

angelo•6mo ago

If your application depends on postgres connections and you get random disconnects, very likely your connection pool will exhaust.

KiBenderOP•6mo ago

Already redeployed with v2 as well

angelo•6mo ago

It’s 2 AM my time but if you can, keep us updated- we have the whole logistics team on standby addressing customer issues this whole week. We know about the flakiness and we are addressing every issue as quickly as we can dispatch them. !t

Duchess•6mo ago

New reply sent from Help Station thread:

This thread has been escalated to the Railway team.

You're seeing this because this thread has been automatically linked to the Help Station thread.

KiBenderOP•6mo ago

Thank you both of you

angelo•6mo ago

Melissa is our on-call this week, it’s very likely the person from the Railway team who will respond will be her. Thank you for your patience even among your frustration.

KiBenderOP•6mo ago

- We know the app is only failing at day time for us. Our app users are all from my country so its when people are using it - We have a fairly large db which is updated often. We run a shopify app and we have to sync every order update and delete with webhooks. We use kyesely query builder for this - We have 2 dbs for data and settings - We have zero logs about why its failing

angelo•6mo ago

Doesn’t seem that random then if there is load, however, my gut is telling me those connection drops is exhausting the mem pool.

KiBenderOP•6mo ago

But ram usage doesn't look like there is much load

angelo•6mo ago

However, the connection drop might be- nothing concrete, I think we would also attempt to try the nee proxy as well.

KiBenderOP•6mo ago

Random in the sense, i cannot identify something that is causing this based on the logs.

angelo•6mo ago

Understood, thats frustrating.

KiBenderOP•6mo ago

how do I try the new proxy?

angelo•6mo ago

Edge Proxy Beta in service settings

KiBenderOP•6mo ago

Should I turn it on or see if it fails with the new runtime first?

angelo•6mo ago

(I for one am not a big fan of telling people to enable beta features to fix prod problems) Let’s wait for the new runtime to do it’s thing Then if so, we can add proxy.

KiBenderOP•6mo ago

Cool. thank you I will report back

Brody•6mo ago

fwiw I have seen the v2 runtime fix flakiness

KiBenderOP•6mo ago

that would be ideal

Brody•6mo ago

same with the edge proxy, and same with the v2 builder, railway is cooking

angelo•6mo ago

To set expectations, I likely won’t be doing the follow up as my shift is ending (well I did but I take it personally when anyone has a bad time on here) but jumped to make sure to let you know that the team is on top as we can be here. Not fast enough, but it’s all hands on deck here. Checking in @KiBender - no issues over your day?

Ray•6mo ago

cc @Mig

Mig•6mo ago

hey @KiBender, I've built the new edge proxy we're asking you to use. We've had some customers mentioned intermittent 503 responses and switching to the new proxy has resolved the issue. On the topic of beta software, this proxy has been in production for over several months handling our own internet properties (nixpacks.com, help.railway.app, blog.railway.app) and other user's have been opting in to the new proxy over the past month with great success. Every customer we've suggested switching has reported no more 503 responses (application failed to respond). It also offers faster deploy times and a request id mechanism so we can determine what went wrong with your request. We plan on surfacing this information in the dashboard directly soon. Any issues you run in to you can revert to the old proxy and it will be using the proxy you're use to after 1 minute (DNS cache). You can also @ me directly on discord and I'll help right away (I'm ET timezone)

KiBenderOP•6mo ago

Hello @Angelo No issues yesterday. Thanks @Mig Will try to switch to the new proxy today and revert backl

angelo•6mo ago

Beautiful news, we will keep our eyes out.

Brody•6mo ago

love to hear this, v2 everything is a massive improvement

KiBenderOP•6mo ago

Hello @Mig @angelo Today morning the app crashed. It was saying it crashed in railway. But it did not restart. Even though in the service settings Restart policy is Always I clicked restart it said it restarted but the app is still down

Mig•6mo ago

it's in this state right now right ?

KiBenderOP•6mo ago

Yes No logs in railway or no logs in glitchtip ( we use it for errors)

angelo•6mo ago

This service? https://railway.app/project/6d0f799e-be59-4388-899d-f00456f30667/service/68f526be-f01d-4a18-9978-cb549b4e5d7a?id=f1724d68-6d54-4b12-a84e-255b501e9584

Railway

404 - Page not found

Railway is an infrastructure platform where you can provision infrastructure, develop with that infrastructure locally, and then deploy to the cloud.

KiBenderOP•6mo ago

https://gst-next.storetools.io/ The app is just loading. It wont work as a direct url but it should show unauthorized Yes The prod environment

angelo•6mo ago

Gotcha- mind if I trigger a rebuild?

KiBenderOP•6mo ago

No worries. But I wanted to show you this before I click

angelo•6mo ago

Wanna see where it's related. Do so, your uptime matters more than our debugging.

KiBenderOP•6mo ago

100% Can I redeploy now?

angelo•6mo ago

Yes

KiBenderOP•6mo ago

Weirdly this time we did not see the black screen. It was loading then errored out. Our app runs insde shopify. I feel this is a little bit less scary than the black railway error

Mig•6mo ago

app is loading now

KiBenderOP•6mo ago

App is working now after the redeploy

Mig•6mo ago

I believe I know what the issue was. Will dig into it.

KiBenderOP•6mo ago

I would appreciate it a lot if you can help us solve this. We actually see increase in uninstalls now

KiBenderOP•6mo ago

Railway shows both as active

angelo•6mo ago

In this case, I think it would be wise to spin up a few replicas for failover just in case. I think we know the root cause in this case and it isn't a V2 bug, just a possibly extremely unlucky circumstance.

KiBenderOP•6mo ago

Damn. Was this the same issue with v1?

angelo•6mo ago

No, the switch to V2 fixed the Railway error issue that you saw. The edit: likely reason you ran into this error was the machine your app landed on was cpu bounded and slow to respond. - a case of luck in this case.

Mig•6mo ago

If you increase the replica count your app will be placed on multiple boxes so if 1 has an issue we'll route to another instance this wasn't an issue with the v2 proxy.

KiBenderOP•6mo ago

I can create 3 replicas but we have cron jobs happening every 24 hours. Would we need to move that out to a seperate service?

angelo•6mo ago

Preferably yes I have also compensated your account for the outages that you've faced and for you to not have to bear the increased financial cost of the replicas on your account. Can you confirm that you recieved credits on your account?

KiBenderOP•6mo ago

Yes I see "$ 200.00" Credits Available

angelo•6mo ago

Gotcha- hopefully this holds you over in the meantime. Can you go ahead and deploy a few replicas?

KiBenderOP•6mo ago

How many do you suggest? I have never done this before

angelo•6mo ago

3 should be good This is horizontal scaling and how you get redundancy, welcome to Scale™️ Jokes aside, this will make it so that there should always be a healthy instance to take a request. Essentially works like this [] [] [] | | | \ |/ [Proxy]

KiBenderOP•6mo ago

Thank you. I think I understand this. But we would still love to understand what is going on in the app

angelo•6mo ago

As in? The error you faced before or what replicas do to your app.

Mig•6mo ago

I want to ship to the dashboard access logs. So in this case you would see that a request was made to your app but the app failed to respond (timeout). It wouldn't tell you that the machine the app was on was having issues though.

KiBenderOP•6mo ago

I'm still not fully sure this is just a railway glitch up. I have seen "Application not responding" many times before. - Railway was not auto restarting before. It didn't now as welll - No logs from our app before and today. Either we are extremely unlucky or I see a pattern:blob_help:

angelo•6mo ago

I will go ahead and answer both for you: 1. You got very unlucky with your workload placement, it didn't die, hence why you didn't see a crash, the box it was on was pinned which made responses extremely slow. So it wouldn't have restarted since your application never exited with a non-zero exit code. 1b. Runtime V2 has a known issue with logs, we are in progress to fix this issue. 2. Replicas just run copies of your server, but we essentially give you only one interface to all the instances, you don't need to change anything with the And after yesterday, short of this issue, you weren't reporting any issue, was this still case after the move to V2? (w.r.t to restarts)

KiBenderOP•6mo ago

yesterday I recieved this

KiBenderOP•6mo ago

We didn't trigger a deploy It was the change to v2 which triggered the last deployment Follow up for 1 We are not seeing this first time after the v2 change. This has been happening for some weeks now. We have been thinking this is a code issue and trying to add try catch everywhere

KiBenderOP•6mo ago

Atleast since may.

angelo•6mo ago

Yea, so V2 runtime fixes the above. But you are saying you are getting random deploy crashed emails?

KiBenderOP•6mo ago

Are you sure this is because of v1 runtime? (No logs) If that is the case and yesterday we got unlucky. I'm happy 🙂

angelo•6mo ago

Yep, your customers won't see that error page anymore.

KiBenderOP•6mo ago

Hi, is there any update on this?

KiBenderOP•6mo ago

Hello, With 3 replicas we not longer can use the db via postico. Anything we should do about this?

Brody•6mo ago

are you using a database on railway?

KiBenderOP•6mo ago

Yes @Angelo

angelo•6mo ago

Back? Or is this related to the DB

KiBenderOP•6mo ago

We cannot connect to db. What should we do? How many connections does railway allow? @T4P4N

Brody•6mo ago

not a railway limitation, postgres itself has a default limitation of 100 connections

KiBenderOP•5mo ago

Hello @Angelo we had another downtime with no errors. I reduced the replica count so I can connect to db with postico. Which might be why the downtime was visible. We are still not sure how to make a conclusion out of this. Should we always keep 3 instances running?

Brody•5mo ago

at this point I think it's plenty safe to say that this is an issue with your application itself

Adam•5mo ago

how much load testing, AB testing, regression testing, etc do you do? Same goes for logging. if something is failing with no logs, that means you’re not logging enough

Ray•5mo ago

You're likely creating too many connections to your database, and that gets worse with each replica because another instance of your app means $numberOfConnections * $numberOfReplicas. I'd suggest looking into connection pooling via the SQL client/ORM you're using in your app, and tracking down where you're creating connections (ideally you'd only need a handful that gets passed around; the connection object should be a singleton that gets reused across anything that requires it inside your app)

KiBenderOP•5mo ago

I understand. Our app is still failing. Today it went down 2 times. Its a white screen instead of black. Anything we should look at? We have no logs related to the failure still

Brody•5mo ago

what ray said would be a very good starting point, perhaps you are opening a new database connection for every call to the database that your app makes? look into that, maybe you will need to implement pooling as he mentioned.

gazhay•5mo ago

I think this may be completely unrelated, but I have found that sometimes, without error, the internal routing between app and DB has failed at some point and does not re-establish so the front end goes down. Then when I restart/redeploy either everything is ok, or I do then get an internal routing error. I have just switched to "v2" setting of "deploy">"runtime" in the service settings as a possible solution to this and the need to wait a number of seconds before connecting to the db at boot time (also routing errors), but its too soon to say if this has been the solution. We only use one connection to the DB though which is why I think it may be unrelated to your problem.

Gaming

Programming

We need serious help to continue using railway. Server is halting often and we are not sure why.