Railway•13mo ago

Is railway down?

I can't open my railway project page and I get error 401 and database isn't ready...

79 Replies

Percy•13mo ago

Project ID: N/A

Brody•13mo ago

a picture is worth 1000 words

Ryan!•13mo ago

(different customer) Our app looks health in the dashboard, but we're getting the error page too

Ryan!•13mo ago

https://railway.app/project/5d77fb37-1fba-4a55-bbd3-e0fab8243723

Railway

Railway is an infrastructure platform where you can provision infrastructure, develop with that infrastructure locally, and then deploy to the cloud.

ENT3I <3OP•13mo ago

ENT3I <3OP•13mo ago

anything loads but app looks health in my dashbooard too

Ryan!•13mo ago

My app appears back

ENT3I <3OP•13mo ago

same here my app is loading again

Ryan!•13mo ago

I came here after getting an alert that the URL for my app was down when I checked the logs my app showed no issues.

ENT3I <3OP•13mo ago

exactly the same thing

Brody•13mo ago

did you get a 503 too?

ENT3I <3OP•13mo ago

I got url errors coming up in my slack and when I checked my railway project url it didn't load at all got errors 401, 503 and "database isn't ready" message

Brody•13mo ago

might have been a little blip with the proxy maybe related? https://discord.com/channels/713503345364697088/1171292427118706769/1176503689150730342 specifically

We're currently in the process of merging a whole bunch of networking updates

ENT3I <3OP•13mo ago

looking some error logs I can see this: ERROR: The DNS server returned an error, perhaps the server is offline (item 0) getaddrinfo EAI_AGAIN postgres.railway.internal In one of the errors while my app was "down"

Brody•13mo ago

i saw that too, though this is not the first time ive gotten that error

ENT3I <3OP•13mo ago

But how that can affect my private network between my services and my postgres* database? Does that migration affect all networks?

Brody•13mo ago

its a dns lookup error for a private domain i dont know what it effects, i know as much as what char said in that message

Ryan!•13mo ago

Thanks for the quick response on this. 👋

ENT3I <3OP•13mo ago

Yeah thanks. And yes Brody I think the issue was on the proxy or smth I'm seeing the same error message in a totally external tool from railway

Brody•13mo ago

known issue https://discord.com/channels/713503345364697088/1171292427118706769/1171856317158268940

Brody•13mo ago

going to tag in @char8 here too, just for visibility

char8•13mo ago

this wasn't anything to do with the updates, though we're looking into some weird alarms between 17:14 - 17:30 UTC

Brody•13mo ago

had a feeling it wasnt related, but good to know!

char8•13mo ago

gonna dig deeper into what happened and how it affected (what looks to be DNS resolutions mainly) - will create a retro incident with the times. but yeah - a proper fix for more robust DNS is in the pipe

ENT3I <3OP•13mo ago

char8 where Railway is based up what timezone, so I can give you exactly what time it started to fail and when it worked again

char8•13mo ago

UTC always works 🙏 , we had a routing propagation alarm fire 17:14-17:22 UTC [it only alerted us to it at 17:20 shortly before it resolved, so I gotta tweak something there]

ENT3I <3OP•13mo ago

Ok. For me it started to fail at UTC 17:14:50 to 17:21:39 does it make sense?

char8•13mo ago

yep that matches perfectly thanks! that confirms what we're seeing. Looks like a network cut of some form.

ENT3I <3OP•13mo ago

Ok great. Thanks

char8•13mo ago

looks like this is recurring gonna create an incident

ENT3I <3OP•13mo ago

yep, it is happening again

Brody•13mo ago

it sure is

char8•13mo ago

we isolated it to the host that ran the control plane 😞 , just went 100% on I/O and locked up the server. Gonna fast track some of the patches that Brody spoke about that will mitigate these issues. https://railway.instatus.com/clp8myu281083bhohpt28odbp

Brody•13mo ago

yippee

ENT3I <3OP•13mo ago

it happened again a few minutes ago

Brody•13mo ago

can confirm

char8•13mo ago

6 mins ago? now recovered right? for about 20 secs

Brody•13mo ago

yep

ENT3I <3OP•13mo ago

yes, for me at xx (your time): 27 xx:27

Brody•13mo ago

char8•13mo ago

I've got a patch that just got approved that I want to land in the morning (it's like 1am local for me). Should put a end to these blips and de-risk anything like the 5 min outage we had earlier - will update the threads when that's out. it looks like we're getting a disproportionate number of lookup requests from a small selection of apps, and that's causing these resource spikes as the infra gets stressed. Fixes should mitigate that.

ENT3I <3OP•13mo ago

so the issue is caused by some small selection of apps? I hope mine is not one of them lol salute

Brody•13mo ago

char8•13mo ago

nah - whatever it is we're talking 1k+ rps of lookups 😅

Brody•13mo ago

~~that would be me~~ ~~...im not joking~~

char8•13mo ago

1k rps? 😅 since nov 6? I mean props for load testing us

Brody•13mo ago

let me do some maths nah my bad its not 1k rps, but i do make a batch of requests every minute

char8•13mo ago

yep that's not gonna make a dent. Also the 1krps thing should be fine, just creaky v1 implementation on our side which needed to be more defensive. I'll update here once I test tomorrow. Might elect a subset of hosts to test on first and then rollout throughout the day.

Brody•13mo ago

sounds good!

ENT3I <3OP•13mo ago

thanks

char8•13mo ago

currently testing the patched dnsserver on a small set of hosts - wider rollout tomorrow if it looks good overnight. It might not help with the P99s as much, but it'll hopefully eliminate those occasional blips you see

ENT3I <3OP•13mo ago

hi. Not sure if it has anything to do but I experimented some timeouts in my railway project

ENT3I <3OP•13mo ago

also I tried to re-deploy a running image for a service and it took 10 minutes and I had to abort it and re-deploy it freezes in this window sometimes too

ENT3I <3OP•13mo ago

I have to refresh couple of times

char8•13mo ago

this would be something different

ENT3I <3OP•13mo ago

and then it loads

Yeti•13mo ago

funnily enough im running into this as well

char8•13mo ago

we'll take a look

ENT3I <3OP•13mo ago

thanks!

char8•13mo ago

yep incident created - cooper had just spotted it when I landed on that channel - we're on it!

ENT3I <3OP•13mo ago

Awesome. Thanks!! I guess you guys still working on it, but now dashboard seems to be working ok but my service url doesn't load

ENT3I <3OP•13mo ago

Brody•13mo ago

careful with the ping replies please

ENT3I <3OP•13mo ago

what you mean?

Brody•13mo ago

by default you ping when you reply to a message just something to keep in mind

ENT3I <3OP•13mo ago

seems up again now

char8•13mo ago

this is a different issue the dashboard one is fixed did you have a project ID?

ENT3I <3OP•13mo ago

53d90c0e-0d69-400d-8c78-aaa211f288a1

char8•13mo ago

we show the application failed to respond page when the app times out on teh request. I see a bunch of CPU spikes on the process, wondering if its hitting some timeout or a large request or smth.

ENT3I <3OP•13mo ago

Ok, maybe it was a coincidence and hit a large request... or smth... I'll keep an eye on it, thanks for responding and solving the issues!

char8•13mo ago

yeah let us know if you see it again or see a pattern 🙏

ENT3I <3OP•13mo ago

Sure. I will. Have a good night! 🌜 Hi. I got two logs today morning. One at UTC 6:45 AM (getaddrinfo EAI_AGAIN postgres.railway.internal) Another one at UTC 7:00 AM (getaddrinfo EAI_AGAIN postgres.railway.internal) I only got these two logs so probably it got fixed after a couple of seconds. (Posting it here just in case you want to look something)

Brody•13mo ago

you likely weren't on a host with the fixes so it would be expected that you'd still see this

char8•13mo ago

yep I've been running a test on a patched machine and an unpatched one - the patched one had no drops thus far, patched one did around 7am UTC so all good for wider rollout later today [mostly so I'm around to watch it once its live] wider rollout done ✅ , we've been running with it for about 24h. Should hopefully see far far fewer dropouts

Brody•13mo ago

can confirm there has been zero dropouts, rock sold now!

Gaming

Programming

Is railway down?