We need serious help to continue using railway. Server is halting often and we are not sure why.
Every so often we see "Application Failed to Respond".
There is no log about a crash or fail in railway logs. We are not sure if the app actually failed.
We do not see any logs to why this is failing. Nor is there logs of server restarting. We have a log for server start
Server running on PORT ${PORT}
But this is missing when the app is back up as well.
Ram and CPU usage seems normal as just before the app is turned off.
PLEASE help us. We have tried everythig, sprinkeled logs everywhere, added try catch everywhere. At this point the only thing left to move is to try a different host from our end.
Just in the last few hours the app went off 2 times and it takes a while to come back up
Project id: 6d0f799e-be59-4388-899d-f00456f30667
I have a post for this already: https://discord.com/channels/713503345364697088/1246339314829492294/1246339314829492294100 Replies
Project ID:
6d0f799e-be59-4388-899d-f00456f30667
are you using the legacy or the v2 runtime?
How do I check? Maybe. This is an old app
in the service settings
Yes
switch to v2
Will this fix the issue?
I make absolutely no promises
If the server is restarting I should see "Server running on PORT ${PORT}" again right?
That is not happening. Its just logs dissapear then after a while it comes back up
yes that is correct, that's the behaviour you should see if the app is restarting, but with the information I have this seems like your app is locking up.
try the v2 runtime and report back
Can you explain what is "locking up"?
soft locking, Google could explain the term better than I could
Okay got it. is there something we should do in the app level to avoid this?
I wouldn't be able to tell you that, there's a million different things that can cause code to lock up, but definitely try the v2 runtime!
Hey there @KiBender - I know you’ve raised support questions in the past and I am sorry to continually make you rehash information. Are you perchance using Prisma or managing a large amount of DB connections?
There are a few things in flight that we are flighting to fix stability on the platform and I don’t wanna rule out Railway but wanna make sure I am able to gather the properties of your app.
(Such as V2 Runtime and V2 Proxy)
Yes we are using prisma. But we switched to a new database using kysely from you guys suggestion for heavy memory usage in prisma
Prisma for app settings
kysely for data
both seperate db
Fair and noted, do you see the restarts on load or just randomly?
Randomly, We've been trying to associate the last logs and trying to work around from there but honestly everytime its different and already handled errors
Also for the DB connections, are you using the Internal network? (I think you are)
If random then I have a strong suspicion that the new runtime would help then.
"monorail.proxy.rlwy.net" I think sometime in the past to debug this we moved to the non internal network
and
viaduct.proxy.rlwy.net
Both are external url right?I would suggest you move to the Internal network so you don’t get hit with egress charges but also, we control the network not GCP and we have been continually mitigating public connection issues. (Knock on wood none more yet but you never know.)
I will do that, will try to switch to internal urls
(and the v2 runtime)
If your application depends on postgres connections and you get random disconnects, very likely your connection pool will exhaust.
Already redeployed with v2 as well
It’s 2 AM my time but if you can, keep us updated- we have the whole logistics team on standby addressing customer issues this whole week. We know about the flakiness and we are addressing every issue as quickly as we can dispatch them.
!t
New reply sent from Help Station thread:
This thread has been escalated to the Railway team.You're seeing this because this thread has been automatically linked to the Help Station thread.
Thank you both of you
Melissa is our on-call this week, it’s very likely the person from the Railway team who will respond will be her.
Thank you for your patience even among your frustration.
- We know the app is only failing at day time for us. Our app users are all from my country so its when people are using it
- We have a fairly large db which is updated often. We run a shopify app and we have to sync every order update and delete with webhooks. We use kyesely query builder for this
- We have 2 dbs for data and settings
- We have zero logs about why its failing
Doesn’t seem that random then if there is load, however, my gut is telling me those connection drops is exhausting the mem pool.
But ram usage doesn't look like there is much load
However, the connection drop might be- nothing concrete, I think we would also attempt to try the nee proxy as well.
Random in the sense, i cannot identify something that is causing this based on the logs.
Understood, thats frustrating.
how do I try the new proxy?
Edge Proxy Beta in service settings
Should I turn it on or see if it fails with the new runtime first?
(I for one am not a big fan of telling people to enable beta features to fix prod problems)
Let’s wait for the new runtime to do it’s thing
Then if so, we can add proxy.
Cool. thank you
I will report back
fwiw I have seen the v2 runtime fix flakiness
that would be ideal
same with the edge proxy, and same with the v2 builder, railway is cooking
To set expectations, I likely won’t be doing the follow up as my shift is ending (well I did but I take it personally when anyone has a bad time on here) but jumped to make sure to let you know that the team is on top as we can be here.
Not fast enough, but it’s all hands on deck here.
Checking in @KiBender - no issues over your day?
cc @Mig
hey @KiBender, I've built the new edge proxy we're asking you to use. We've had some customers mentioned intermittent 503 responses and switching to the new proxy has resolved the issue.
On the topic of beta software, this proxy has been in production for over several months handling our own internet properties (nixpacks.com, help.railway.app, blog.railway.app) and other user's have been opting in to the new proxy over the past month with great success. Every customer we've suggested switching has reported no more 503 responses (application failed to respond). It also offers faster deploy times and a request id mechanism so we can determine what went wrong with your request. We plan on surfacing this information in the dashboard directly soon.
Any issues you run in to you can revert to the old proxy and it will be using the proxy you're use to after 1 minute (DNS cache). You can also @ me directly on discord and I'll help right away (I'm ET timezone)
Hello @Angelo No issues yesterday.
Thanks @Mig Will try to switch to the new proxy today and revert backl
Beautiful news, we will keep our eyes out.
love to hear this, v2 everything is a massive improvement
Hello @Mig @angelo Today morning the app crashed. It was saying it crashed in railway.
But it did not restart. Even though in the service settings Restart policy is Always
I clicked restart it said it restarted but the app is still down
it's in this state right now right ?
Yes
No logs in railway or no logs in glitchtip ( we use it for errors)
Railway
404 - Page not found
Railway is an infrastructure platform where you can provision infrastructure, develop with that infrastructure locally, and then deploy to the cloud.
https://gst-next.storetools.io/
The app is just loading.
It wont work as a direct url but it should show
unauthorized
Yes
The prod environmentGotcha- mind if I trigger a rebuild?
No worries. But I wanted to show you this before I click
Wanna see where it's related.
Do so, your uptime matters more than our debugging.
100%
Can I redeploy now?
Yes
Weirdly this time we did not see the black screen. It was loading then errored out. Our app runs insde shopify. I feel this is a little bit less scary than the black railway error
app is loading now
App is working now after the redeploy
I believe I know what the issue was. Will dig into it.
I would appreciate it a lot if you can help us solve this. We actually see increase in uninstalls now
Railway shows both as active
In this case, I think it would be wise to spin up a few replicas for failover just in case.
I think we know the root cause in this case and it isn't a V2 bug, just a possibly extremely unlucky circumstance.
Damn. Was this the same issue with v1?
No, the switch to V2 fixed the Railway error issue that you saw. The edit: likely reason you ran into this error was the machine your app landed on was cpu bounded and slow to respond. - a case of luck in this case.
If you increase the replica count your app will be placed on multiple boxes so if 1 has an issue we'll route to another instance
this wasn't an issue with the v2 proxy.
I can create 3 replicas but we have cron jobs happening every 24 hours. Would we need to move that out to a seperate service?
Preferably yes
I have also compensated your account for the outages that you've faced and for you to not have to bear the increased financial cost of the replicas on your account.
Can you confirm that you recieved credits on your account?
Yes I see "$ 200.00" Credits Available
Gotcha- hopefully this holds you over in the meantime. Can you go ahead and deploy a few replicas?
How many do you suggest?
I have never done this before
3 should be good
This is horizontal scaling and how you get redundancy, welcome to Scale™️
Jokes aside, this will make it so that there should always be a healthy instance to take a request.
Essentially works like this
[] [] []
| | |
\ |/
[Proxy]
Thank you. I think I understand this. But we would still love to understand what is going on in the app
As in?
The error you faced before or what replicas do to your app.
I want to ship to the dashboard access logs. So in this case you would see that a request was made to your app but the app failed to respond (timeout). It wouldn't tell you that the machine the app was on was having issues though.
I'm still not fully sure this is just a railway glitch up. I have seen "Application not responding" many times before.
- Railway was not auto restarting before. It didn't now as welll
- No logs from our app before and today.
Either we are extremely unlucky or I see a pattern:blob_help:
I will go ahead and answer both for you:
1. You got very unlucky with your workload placement, it didn't die, hence why you didn't see a crash, the box it was on was pinned which made responses extremely slow. So it wouldn't have restarted since your application never exited with a non-zero exit code.
1b. Runtime V2 has a known issue with logs, we are in progress to fix this issue.
2. Replicas just run copies of your server, but we essentially give you only one interface to all the instances, you don't need to change anything with the
And after yesterday, short of this issue, you weren't reporting any issue, was this still case after the move to V2? (w.r.t to restarts)
yesterday I recieved this
We didn't trigger a deploy
It was the change to v2 which triggered the last deployment
Follow up for 1
We are not seeing this first time after the v2 change. This has been happening for some weeks now. We have been thinking this is a code issue and trying to add try catch everywhere
Atleast since may.
Yea, so V2 runtime fixes the above. But you are saying you are getting random deploy crashed emails?
Are you sure this is because of v1 runtime? (No logs)
If that is the case and yesterday we got unlucky. I'm happy 🙂
Yep, your customers won't see that error page anymore.
Hi, is there any update on this?
Hello, With 3 replicas we not longer can use the db via postico. Anything we should do about this?
are you using a database on railway?
Yes
@Angelo
Back?
Or is this related to the DB
We cannot connect to db. What should we do?
How many connections does railway allow?
@T4P4N
not a railway limitation, postgres itself has a default limitation of 100 connections
Hello @Angelo we had another downtime with no errors. I reduced the replica count so I can connect to db with postico. Which might be why the downtime was visible. We are still not sure how to make a conclusion out of this. Should we always keep 3 instances running?
at this point I think it's plenty safe to say that this is an issue with your application itself
how much load testing, AB testing, regression testing, etc do you do?
Same goes for logging.
if something is failing with no logs, that means you’re not logging enough
You're likely creating too many connections to your database, and that gets worse with each replica because another instance of your app means
$numberOfConnections * $numberOfReplicas
. I'd suggest looking into connection pooling via the SQL client/ORM you're using in your app, and tracking down where you're creating connections (ideally you'd only need a handful that gets passed around; the connection object should be a singleton that gets reused across anything that requires it inside your app)I understand. Our app is still failing. Today it went down 2 times. Its a white screen instead of black.
Anything we should look at?
We have no logs related to the failure still
what ray said would be a very good starting point, perhaps you are opening a new database connection for every call to the database that your app makes? look into that, maybe you will need to implement pooling as he mentioned.
I think this may be completely unrelated, but I have found that sometimes, without error, the internal routing between app and DB has failed at some point and does not re-establish so the front end goes down.
Then when I restart/redeploy either everything is ok, or I do then get an internal routing error.
I have just switched to "v2" setting of "deploy">"runtime" in the service settings as a possible solution to this and the need to wait a number of seconds before connecting to the db at boot time (also routing errors), but its too soon to say if this has been the solution.
We only use one connection to the DB though which is why I think it may be unrelated to your problem.