Server goes down randomly throughout the day
Recently I realized the production railway server goes down randomly throughout the day and show 503 error. What's going on? Can someone take a look?
My project ID is
99e122f7-a96e-42ba-95aa-325cd3e66c82
56 Replies
Project ID:
99e122f7-a96e-42ba-95aa-325cd3e66c82
do you have any logs for when you get the error page?
what kind of app?
no logs, I just can't reach the server so there's no logging
it's a fastapi backend
right but if your application was erroring and not responding, ideally there would be logs
all of the 0s are downtimes
It looks like it's railway that's not responding
No error from my app
that page is shown when your application doesn't respond
are these https requests? what am I looking at here?
The server soetimes works and sometimes doesn't without any changes from my side
Yup, this is in postman
I'm calling the backend hosted on railway
do you have a custom domain?
I understand how it sounds but that does not rule out an issue with your app
it also does not rule out an issue with railway, but from experience it's more often an issue with the application
Yes I have a custom domain
How to debug if it's railway or my app? It works perfectly locally
My other friends also have uptime problems with railway and have migrated to render
unfortunately working locally does not rule out an issue with the application either
do you have the edge proxy enabled?
What's edge proxy? Is this from domain side (e.g. namecheap)
it would be in the service settings
Should I enable this?
yes, but first, you said your domain provider was namecheap?
yes
are you sure you are using the correct generated cname it gave you when you set it up?
yes it's been assigend to this domain name for months
I'm sorry but that answer does not instill confidence, I would like to ask for confirmation
yes I'm sure
you are using the generated cname, not the auto generated domain, correct?
Yes
go ahead and enable the edge proxy
Done, what should I do next?
wait and see if you continue to have issues
When should I check back in? Just tried postman and still have the same issue
what's the state of your deployment
deployed
I'm sorry but that's not a valid state
yes, please tell me it's state
What does that mean?
it's deployment state
Completed?
your app has exited
this would not be a platform issue
Where do you see that the app has exited
completed
How do I fix it?
first, let me correct myself, the edge proxy is not going to help here, I had asked you to make that change without enough information from you.
second, since this is an issue with your application I would recommend implementing error handing everywhere and verbose logging to help you narrow down the issue.
remember, railway only ever runs your code as-is, so if it's exiting that's something your app is doing, not the platform
What does it mean for the app to have exited?
The app bugged out and shut down?
the app exited with a non error code for (at this time) an unknown reason
Hmm I see, I'll look into it
thanks
I wish you the best of luck in your debugging endeavour
Is it possible it exceeded resource constraints? Is there some way to check for that?
you think your app could have exceeded 32gb of ram?
We run a 100M parameter LLM model, 32b should be enough
what do your memory metrics look like?
Hmm goes up quite high
have you received any emails from railway that state you ran out of memory?
Nope these are the latest emails
then it doesn't seem like that's the issue
What does completed mean? This deployment is "completed" instead of "active" but up and running, have healthy logs
Completed is when you exit using a 0 exit code. However, going to raise this to the team for investigation.
!t
New reply sent from Help Station thread:
This thread has been escalated to the Railway team.You're seeing this because this thread has been automatically linked to the Help Station thread. New reply sent from Help Station thread:
It does look to me like the app is restarting if I'm reading logs correctly, but doesn't seem like it's due to OOM or CPU based on the metrics graphs.Can you confirm if this log line prints when the app starts: "DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): huggingface.co:443". That could also explain why the status is being updated to "Completed". As Brody suggested, I would also encourage you to add some more verbose logging and error handling to help track down the issue, even starting with a clear debug line for when the app starts so it's easier to see when/if it restarts.You're seeing this because this thread has been automatically linked to the Help Station thread.