We need serious help to continue using railway. Server is halting often and we are not sure why.

Every so often we see "Application Failed to Respond". There is no log about a crash or fail in railway logs. We are not sure if the app actually failed. We do not see any logs to why this is failing. Nor is there logs of server restarting. We have a log for server start Server running on PORT ${PORT} But this is missing when the app is back up as well. Ram and CPU usage seems normal as just before the app is turned off. PLEASE help us. We have tried everythig, sprinkeled logs everywhere, added try catch everywhere. At this point the only thing left to move is to try a different host from our end. Just in the last few hours the app went off 2 times and it takes a while to come back up Project id: 6d0f799e-be59-4388-899d-f00456f30667 I have a post for this already: https://discord.com/channels/713503345364697088/1246339314829492294/1246339314829492294
No description
No description
100 Replies
Percy
Percy6mo ago
Project ID: 6d0f799e-be59-4388-899d-f00456f30667
Brody
Brody6mo ago
are you using the legacy or the v2 runtime?
KiBender
KiBenderOP6mo ago
How do I check? Maybe. This is an old app
Brody
Brody6mo ago
in the service settings
KiBender
KiBenderOP6mo ago
Yes
No description
Brody
Brody6mo ago
switch to v2
KiBender
KiBenderOP6mo ago
Will this fix the issue?
Brody
Brody6mo ago
I make absolutely no promises
KiBender
KiBenderOP6mo ago
If the server is restarting I should see "Server running on PORT ${PORT}" again right? That is not happening. Its just logs dissapear then after a while it comes back up
Brody
Brody6mo ago
yes that is correct, that's the behaviour you should see if the app is restarting, but with the information I have this seems like your app is locking up. try the v2 runtime and report back
KiBender
KiBenderOP6mo ago
Can you explain what is "locking up"?
Brody
Brody6mo ago
soft locking, Google could explain the term better than I could
KiBender
KiBenderOP6mo ago
Okay got it. is there something we should do in the app level to avoid this?
Brody
Brody6mo ago
I wouldn't be able to tell you that, there's a million different things that can cause code to lock up, but definitely try the v2 runtime!
angelo
angelo6mo ago
Hey there @KiBender - I know you’ve raised support questions in the past and I am sorry to continually make you rehash information. Are you perchance using Prisma or managing a large amount of DB connections? There are a few things in flight that we are flighting to fix stability on the platform and I don’t wanna rule out Railway but wanna make sure I am able to gather the properties of your app. (Such as V2 Runtime and V2 Proxy)
KiBender
KiBenderOP6mo ago
Yes we are using prisma. But we switched to a new database using kysely from you guys suggestion for heavy memory usage in prisma Prisma for app settings kysely for data both seperate db
angelo
angelo6mo ago
Fair and noted, do you see the restarts on load or just randomly?
KiBender
KiBenderOP6mo ago
Randomly, We've been trying to associate the last logs and trying to work around from there but honestly everytime its different and already handled errors
angelo
angelo6mo ago
Also for the DB connections, are you using the Internal network? (I think you are) If random then I have a strong suspicion that the new runtime would help then.
KiBender
KiBenderOP6mo ago
"monorail.proxy.rlwy.net" I think sometime in the past to debug this we moved to the non internal network and viaduct.proxy.rlwy.net Both are external url right?
angelo
angelo6mo ago
I would suggest you move to the Internal network so you don’t get hit with egress charges but also, we control the network not GCP and we have been continually mitigating public connection issues. (Knock on wood none more yet but you never know.)
KiBender
KiBenderOP6mo ago
I will do that, will try to switch to internal urls
Brody
Brody6mo ago
(and the v2 runtime)
angelo
angelo6mo ago
If your application depends on postgres connections and you get random disconnects, very likely your connection pool will exhaust.
KiBender
KiBenderOP6mo ago
Already redeployed with v2 as well
angelo
angelo6mo ago
It’s 2 AM my time but if you can, keep us updated- we have the whole logistics team on standby addressing customer issues this whole week. We know about the flakiness and we are addressing every issue as quickly as we can dispatch them. !t
Duchess
Duchess6mo ago
New reply sent from Help Station thread:
This thread has been escalated to the Railway team.
You're seeing this because this thread has been automatically linked to the Help Station thread.
KiBender
KiBenderOP6mo ago
Thank you both of you
angelo
angelo6mo ago
Melissa is our on-call this week, it’s very likely the person from the Railway team who will respond will be her. Thank you for your patience even among your frustration.
KiBender
KiBenderOP6mo ago
- We know the app is only failing at day time for us. Our app users are all from my country so its when people are using it - We have a fairly large db which is updated often. We run a shopify app and we have to sync every order update and delete with webhooks. We use kyesely query builder for this - We have 2 dbs for data and settings - We have zero logs about why its failing
angelo
angelo6mo ago
Doesn’t seem that random then if there is load, however, my gut is telling me those connection drops is exhausting the mem pool.
KiBender
KiBenderOP6mo ago
But ram usage doesn't look like there is much load
angelo
angelo6mo ago
However, the connection drop might be- nothing concrete, I think we would also attempt to try the nee proxy as well.
KiBender
KiBenderOP6mo ago
Random in the sense, i cannot identify something that is causing this based on the logs.
angelo
angelo6mo ago
Understood, thats frustrating.
KiBender
KiBenderOP6mo ago
how do I try the new proxy?
angelo
angelo6mo ago
Edge Proxy Beta in service settings
KiBender
KiBenderOP6mo ago
Should I turn it on or see if it fails with the new runtime first?
angelo
angelo6mo ago
(I for one am not a big fan of telling people to enable beta features to fix prod problems) Let’s wait for the new runtime to do it’s thing Then if so, we can add proxy.
KiBender
KiBenderOP6mo ago
Cool. thank you I will report back
Brody
Brody6mo ago
fwiw I have seen the v2 runtime fix flakiness
KiBender
KiBenderOP6mo ago
that would be ideal
Brody
Brody6mo ago
same with the edge proxy, and same with the v2 builder, railway is cooking
angelo
angelo6mo ago
To set expectations, I likely won’t be doing the follow up as my shift is ending (well I did but I take it personally when anyone has a bad time on here) but jumped to make sure to let you know that the team is on top as we can be here. Not fast enough, but it’s all hands on deck here. Checking in @KiBender - no issues over your day?
Ray
Ray6mo ago
cc @Mig
Mig
Mig6mo ago
hey @KiBender, I've built the new edge proxy we're asking you to use. We've had some customers mentioned intermittent 503 responses and switching to the new proxy has resolved the issue. On the topic of beta software, this proxy has been in production for over several months handling our own internet properties (nixpacks.com, help.railway.app, blog.railway.app) and other user's have been opting in to the new proxy over the past month with great success. Every customer we've suggested switching has reported no more 503 responses (application failed to respond). It also offers faster deploy times and a request id mechanism so we can determine what went wrong with your request. We plan on surfacing this information in the dashboard directly soon. Any issues you run in to you can revert to the old proxy and it will be using the proxy you're use to after 1 minute (DNS cache). You can also @ me directly on discord and I'll help right away (I'm ET timezone)
KiBender
KiBenderOP6mo ago
Hello @Angelo No issues yesterday. Thanks @Mig Will try to switch to the new proxy today and revert backl
angelo
angelo6mo ago
Beautiful news, we will keep our eyes out.
Brody
Brody6mo ago
love to hear this, v2 everything is a massive improvement
KiBender
KiBenderOP6mo ago
Hello @Mig @angelo Today morning the app crashed. It was saying it crashed in railway. But it did not restart. Even though in the service settings Restart policy is Always I clicked restart it said it restarted but the app is still down
Mig
Mig6mo ago
it's in this state right now right ?
KiBender
KiBenderOP6mo ago
Yes No logs in railway or no logs in glitchtip ( we use it for errors)
angelo
angelo6mo ago
Railway
404 - Page not found
Railway is an infrastructure platform where you can provision infrastructure, develop with that infrastructure locally, and then deploy to the cloud.
KiBender
KiBenderOP6mo ago
https://gst-next.storetools.io/ The app is just loading. It wont work as a direct url but it should show unauthorized Yes The prod environment
angelo
angelo6mo ago
Gotcha- mind if I trigger a rebuild?
KiBender
KiBenderOP6mo ago
No worries. But I wanted to show you this before I click
angelo
angelo6mo ago
Wanna see where it's related. Do so, your uptime matters more than our debugging.
KiBender
KiBenderOP6mo ago
100% Can I redeploy now?
angelo
angelo6mo ago
Yes
KiBender
KiBenderOP6mo ago
Weirdly this time we did not see the black screen. It was loading then errored out. Our app runs insde shopify. I feel this is a little bit less scary than the black railway error
No description
Mig
Mig6mo ago
app is loading now
KiBender
KiBenderOP6mo ago
App is working now after the redeploy
Mig
Mig6mo ago
I believe I know what the issue was. Will dig into it.
KiBender
KiBenderOP6mo ago
I would appreciate it a lot if you can help us solve this. We actually see increase in uninstalls now
KiBender
KiBenderOP6mo ago
Railway shows both as active
No description
angelo
angelo6mo ago
In this case, I think it would be wise to spin up a few replicas for failover just in case. I think we know the root cause in this case and it isn't a V2 bug, just a possibly extremely unlucky circumstance.
KiBender
KiBenderOP6mo ago
Damn. Was this the same issue with v1?
angelo
angelo6mo ago
No, the switch to V2 fixed the Railway error issue that you saw. The edit: likely reason you ran into this error was the machine your app landed on was cpu bounded and slow to respond. - a case of luck in this case.
Mig
Mig6mo ago
If you increase the replica count your app will be placed on multiple boxes so if 1 has an issue we'll route to another instance this wasn't an issue with the v2 proxy.
KiBender
KiBenderOP6mo ago
I can create 3 replicas but we have cron jobs happening every 24 hours. Would we need to move that out to a seperate service?
angelo
angelo6mo ago
Preferably yes I have also compensated your account for the outages that you've faced and for you to not have to bear the increased financial cost of the replicas on your account. Can you confirm that you recieved credits on your account?
KiBender
KiBenderOP6mo ago
Yes I see "$ 200.00" Credits Available
angelo
angelo6mo ago
Gotcha- hopefully this holds you over in the meantime. Can you go ahead and deploy a few replicas?
KiBender
KiBenderOP6mo ago
How many do you suggest? I have never done this before
angelo
angelo6mo ago
3 should be good This is horizontal scaling and how you get redundancy, welcome to Scale™️ Jokes aside, this will make it so that there should always be a healthy instance to take a request. Essentially works like this [] [] [] | | | \ |/ [Proxy]
KiBender
KiBenderOP6mo ago
Thank you. I think I understand this. But we would still love to understand what is going on in the app
angelo
angelo6mo ago
As in? The error you faced before or what replicas do to your app.
Mig
Mig6mo ago
I want to ship to the dashboard access logs. So in this case you would see that a request was made to your app but the app failed to respond (timeout). It wouldn't tell you that the machine the app was on was having issues though.
KiBender
KiBenderOP6mo ago
I'm still not fully sure this is just a railway glitch up. I have seen "Application not responding" many times before. - Railway was not auto restarting before. It didn't now as welll - No logs from our app before and today. Either we are extremely unlucky or I see a pattern:blob_help:
angelo
angelo6mo ago
I will go ahead and answer both for you: 1. You got very unlucky with your workload placement, it didn't die, hence why you didn't see a crash, the box it was on was pinned which made responses extremely slow. So it wouldn't have restarted since your application never exited with a non-zero exit code. 1b. Runtime V2 has a known issue with logs, we are in progress to fix this issue. 2. Replicas just run copies of your server, but we essentially give you only one interface to all the instances, you don't need to change anything with the And after yesterday, short of this issue, you weren't reporting any issue, was this still case after the move to V2? (w.r.t to restarts)
KiBender
KiBenderOP6mo ago
yesterday I recieved this
No description
KiBender
KiBenderOP6mo ago
We didn't trigger a deploy It was the change to v2 which triggered the last deployment Follow up for 1 We are not seeing this first time after the v2 change. This has been happening for some weeks now. We have been thinking this is a code issue and trying to add try catch everywhere
KiBender
KiBenderOP6mo ago
Atleast since may.
No description
No description
No description
angelo
angelo6mo ago
Yea, so V2 runtime fixes the above. But you are saying you are getting random deploy crashed emails?
KiBender
KiBenderOP6mo ago
Are you sure this is because of v1 runtime? (No logs) If that is the case and yesterday we got unlucky. I'm happy 🙂
No description
angelo
angelo6mo ago
Yep, your customers won't see that error page anymore.
KiBender
KiBenderOP6mo ago
Hi, is there any update on this?
KiBender
KiBenderOP6mo ago
Hello, With 3 replicas we not longer can use the db via postico. Anything we should do about this?
No description
Brody
Brody6mo ago
are you using a database on railway?
KiBender
KiBenderOP6mo ago
Yes @Angelo
angelo
angelo6mo ago
Back? Or is this related to the DB
KiBender
KiBenderOP6mo ago
We cannot connect to db. What should we do? How many connections does railway allow? @T4P4N
Brody
Brody6mo ago
not a railway limitation, postgres itself has a default limitation of 100 connections
KiBender
KiBenderOP5mo ago
Hello @Angelo we had another downtime with no errors. I reduced the replica count so I can connect to db with postico. Which might be why the downtime was visible. We are still not sure how to make a conclusion out of this. Should we always keep 3 instances running?
Brody
Brody5mo ago
at this point I think it's plenty safe to say that this is an issue with your application itself
Adam
Adam5mo ago
how much load testing, AB testing, regression testing, etc do you do? Same goes for logging. if something is failing with no logs, that means you’re not logging enough
Ray
Ray5mo ago
You're likely creating too many connections to your database, and that gets worse with each replica because another instance of your app means $numberOfConnections * $numberOfReplicas. I'd suggest looking into connection pooling via the SQL client/ORM you're using in your app, and tracking down where you're creating connections (ideally you'd only need a handful that gets passed around; the connection object should be a singleton that gets reused across anything that requires it inside your app)
KiBender
KiBenderOP5mo ago
I understand. Our app is still failing. Today it went down 2 times. Its a white screen instead of black. Anything we should look at? We have no logs related to the failure still
Brody
Brody5mo ago
what ray said would be a very good starting point, perhaps you are opening a new database connection for every call to the database that your app makes? look into that, maybe you will need to implement pooling as he mentioned.
gazhay
gazhay5mo ago
I think this may be completely unrelated, but I have found that sometimes, without error, the internal routing between app and DB has failed at some point and does not re-establish so the front end goes down. Then when I restart/redeploy either everything is ok, or I do then get an internal routing error. I have just switched to "v2" setting of "deploy">"runtime" in the service settings as a possible solution to this and the need to wait a number of seconds before connecting to the db at boot time (also routing errors), but its too soon to say if this has been the solution. We only use one connection to the DB though which is why I think it may be unrelated to your problem.
Want results from more Discord servers?
Add your server