R
Railway14mo ago
sdan

Restart doesn't actually restart

Seems like a service failed after it couldn't connect to a DB... i tried to restart but it never restarted. This has been an ongoing issue for a few weeks
37 Replies
Percy
Percy14mo ago
Project ID: 97046871-517d-4af1-adfa-6b493cccebc3
sdan
sdan14mo ago
97046871-517d-4af1-adfa-6b493cccebc3 usually just get around this issue by redeploying but my project takes 10-15min to build so sometimes an annoyance
Adam
Adam14mo ago
I'm seeing a deployment above your crashed deployment. Looks to me like your restart was successful
sdan
sdan14mo ago
the new deployment was successful yes, but i dont believe that failed container was ever restarted. i can try again sometime later and show if necessary but from the screenshots you can see it says "restart successful" but on the ui it still shows a red box. no new blue box saying its restarting ever popped up -- had to manually deploy since restart didnt work
sdan
sdan14mo ago
running into the same issue again
Adam
Adam14mo ago
Hm very odd. Is your app active? Another user reported a similar issue where their app was in the crashed status visually but was still logging
sdan
sdan14mo ago
yes -- i guess this now comes to semantics on what does restart/redeploy mean... i feel like i should be able to restart a running container and not have to redeploy(build and push that image) just to restart that service hey guys this is a pretty serious issue, our build times are unfortunately very long (20 min tops) and it takes up 20 minutes just to get back "online"
Adam
Adam14mo ago
Why are you restarting your service that often? On code updates you should have a deployment running with previous code that’s shut down when your new code’s healthcheck is complete this seems like user error
sdan
sdan14mo ago
I have 100k+ users a day so it crashes our database almost every 12 -18 hours. this crashes this particular instance so it shows up as "crashed" it could be user error but i would like to just simply restart the container. meaning: delete it, run the same exact image w/ same config, and have it back up
Adam
Adam14mo ago
this definitely sounds like user error. There’s got to be better ways to get around that. Also, with 100k+ users you should be on the teams plan this is not a hobby project as the dev plan is meant for
sdan
sdan14mo ago
not to mention I have other services on railway that simply hand and show up as "application not responding" would be nice to have healthchecks running hourly if thats possible? alright sounds good. i use "we" too often, sorry its just me self funding.
Adam
Adam14mo ago
Unfortunately that all sounds like user/code error. Afaik there’s no way to set up scheduled healthchecks, but if you join the teams plan you can discuss that with the team
angelo
angelo14mo ago
Hey @sdan - this is bug on our end. With that said- is your app crashing or the DB crashing?
sdan
sdan14mo ago
db running on google cloud, i found railway cant handle some stuff so moved most of my infra elsewhere
angelo
angelo14mo ago
Like vector or? Just a scale issue
sdan
sdan14mo ago
yea
angelo
angelo14mo ago
yea to what 😛
sdan
sdan14mo ago
yea vector db and yea scale issue 🙂
angelo
angelo14mo ago
L
sdan
sdan14mo ago
also have google cloud credits
angelo
angelo14mo ago
ok- so on your app, how many connections to the DB are you keeping open?
sdan
sdan14mo ago
8 at a time probably
angelo
angelo14mo ago
What happens when you bump that up?
sdan
sdan14mo ago
no clue honestly i just restart stuff whenever it goes down
angelo
angelo14mo ago
;-;
sdan
sdan14mo ago
there are more issues because the vector db i am using is in beta and runs into race issues all the time
angelo
angelo14mo ago
so, you may wanna increase the number of connections actually wait can you decrease it? it will slow your app but might help with race also do you have a link to that vector DB?
sdan
sdan14mo ago
yeah i have tried multiple things but ultimately i dont run most of my heavy workloads on railway. i just purely do reading on railway
sdan
sdan14mo ago
the AI-native open-source embedding database
the AI-native open-source embedding database
angelo
angelo14mo ago
I know a guy there, we can chat
sdan
sdan14mo ago
and i have probably already chatted with that guy haha. theyre rolling out a refactor next week so hoping that will solve it
angelo
angelo14mo ago
curious, why are you still on Railway then (aside from you being an ex-employee) what are we doing so right even when we seem to get things wrong
sdan
sdan14mo ago
no easy way to run flask servers honestly i do vercel for 99% of stuff but now need to interact with python and vercel is pretty bad at it
angelo
angelo14mo ago
you mean that Google Cloud Run's 99 steps isn't easy 😉 anyway, gotcha- can you dump crash logs when the DB connects reset? I would have a service that uses the Railway API and monitors when DB crashes and just perform a restart ngl in the long term, I am going to flag the UI bug to the team
sdan
sdan14mo ago
google cloud is a mess for sure but its containable mess :). just docker up, docker down, docker remove, docker ps -a. and tailscale for networking and cloudflare for proxying. i have reliable logs, stuff never hangs, and if it does i know exactly whats up. i can check htop, etc. railway hangs and logs stop and stuff gets silently shut off. more often than not i wake up to a text from someone saying my stuff is down and railway still shows a green box which is frustrating. railway api monitoring a db that is not running on railway is def. not railway's fault. its just reliable loggin and make sure that if something crashes that it is fully crashes. i think i turned off notifs for crashes which i will turn back on also as prev. mentioned, having continuous health checks would be nice
sdan
sdan14mo ago
some logs
sdan
sdan14mo ago
again this is entirely my error -- the db crashing should be handled on my end.