Railway•11mo ago

Getting intermittent connection errors on all services connected to my uptime kuma.

Network errors? They seem to be temporary but consistent. Every minute or so. Uptime kuma can't connect to my app server and my n8n servers can't connect to uptime kuma.

55 Replies

Percy•11mo ago

Project ID: N/A

Percy•11mo ago

Project ID: N/A

Deani1232OP•11mo ago

n/a Project of Uptimekuma: 7d1c4d0a-d9b2-4143-aa29-f775a92a5c6e

Medim•11mo ago

Do you have app sleeping on by any chance?

Deani1232OP•11mo ago

Where would I check that? AFAIK, our app servers shouldn't ever sleep.

kylegill (kg)•11mo ago

I came here to report the same, we're on FastAPI on a single instance (no app sleeping), and having been getting errors all morning (for the last hour and a half):

upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: Cannot assign requested address

upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: Cannot assign requested address

I've only been able to replicate on a few occasions, and retires almost always fix it. They are intermittent. We had no code changes to our app server. If helpful our project ID is: 1d3ab213-7e4f-45f6-a136-1326d80d0606

Deani1232OP•11mo ago

This issue only started happening about 2 hours ago, and happens on all services at once.

Medim•11mo ago

that's weird 🤔 are you guys using private network?

Deani1232OP•11mo ago

Medim•11mo ago

I remember this exact same issue in a thread not too long ago https://discord.com/channels/713503345364697088/846875565357006878/1197588980653375588 well maybe it's this?

kylegill (kg)•11mo ago

We don't use private networking (unless there is something automatically configured for us)

Deani1232OP•11mo ago

Our uptime kuma connections all use the public railway urls.

kylegill (kg)•11mo ago

We were alerted by Checkly that we use to monitor our public customer-facing API endpoints. We thought maybe something was wrong on their end, but have since replicated myself on my machine. I came here when that happened, and wondered if it was networking that we didn't control, since we haven' t pushed any code to our affected service in the last week.

johns•11mo ago

Same. We are using private network

Adam•11mo ago

seems like #🚨｜incidents is the cause nothing we can do but wait for the team to solve it

blandthony•11mo ago

Jumping in here as well (x-posted here: https://discord.com/channels/713503345364697088/1197588560564469760/1197588560564469760), we lost 2 leads due to broken demos this morning bc of this

johns•11mo ago

@David @Ray

Medim•11mo ago

#🛂｜readme 5) Don't ping team or conductors Theres a incident going on, wait for it to get resolved

blandthony•11mo ago

Would appreciate a follow up at least

Medim•11mo ago

Yeah, lets just wait for the incident to get resolved

johns•11mo ago

I think I joined the channel before they had that readme. Didn't know

Medim•11mo ago

And see if your issue persists np, just be careful

blandthony•11mo ago

Why is this not more urgent, our whole site is down https://www.toma.so/

Adam•11mo ago

It is urgent. The team is working on it locking this thread now, don’t ping the team. You’re distracting them from solving the issue

angelo•11mo ago

You al should be noticing connections restore. Check now for restoration.

Deani1232OP•11mo ago

Still pinging down

angelo•11mo ago

Yep- new flood. Updating

blandthony•11mo ago

Yeah still down

angelo•11mo ago

okay- seeing connections restore on our end will be 5-15 minutes till DNS resolves @Dean Irwin - looking good now? also @kylegill (kg)

Deani1232OP•11mo ago

yes my pager has stopped flipping.

kylegill (kg)•11mo ago

We haven't detected any alerts since 11:50am MST (18 min ago)

angelo•11mo ago

Our proxy cpu usage is now back to normal- either the DDoS stopped or we swung the hammer enough times to teach them (bad actors) a lesson Okay Good, will update to monitoring

Deani1232OP•11mo ago

Ik we can’t help DDoS attacks from happening but there has to be a way to improve reliability here

Deani1232OP•11mo ago

I can not and will not continue to keep getting questions like this

blandthony•11mo ago

Same, we're gearing up to likely switch to Porter because we are losing customer trust over things like this that we can't control I'd rather our company get DDoS'd directly as we can do something about it

Deani1232OP•11mo ago

And yes, I responded to that query with only great things to say about railway, and how much time the infrastructure saves me. I am on your team.

angelo•11mo ago

Understood, and these are not empty words that it pains me as well when I know that this impacts your business. I am jumping on the call with the infrastructure team in 5 minutes and we are going to conduct a full retrospective. I will share what went down, and how we plan to avoid something like this in the future. With that said: I want you to do what is best for your business and not what is best for Railway. If you need to migrate, we can help you with that or help you configure your environments that your prodction is hardened. In Toma's case, you can invite me to your Slack and we can talk next steps.

blandthony•11mo ago

Sounds good, will do. Really appreciate your hard work

JustJake•11mo ago

FWIW I can almost guarantee you even more pain and equivalent outages for managing a kubernetes cluster Make no mistake, we're not saying the above is acceptable. Just grass is greener type stuff We'll provide a retro on this in about 15 minutes (writing stuff up) (Background: Envoy powers all the HTTP requests. It's being removed in a favor of a more resillient proxy we built in house)

blandthony•11mo ago

Yeah definitely, those are my reservations about moving into K8s ^. We're a one man dev team atm (just me) so whatever guarantees a combination of minimum downtime and ease of use is what we'll go with

JustJake•11mo ago

Sounds good. We'll talk traffic steering with the team and like I said, give us 15m to get together to retro 30 minutes to get back to you

blandthony•11mo ago

Thanks for the transparency too. We're also on your side and in 95% of cases, Railway has worked great for us. It's just that this came at the worst timing for us as a business and now we're a bit shell-shocked

angelo•11mo ago

Even if you aren't we need to know how critical your workloads are so we can best plan for that as well. I appreciate you sharing this.

Deani1232OP•11mo ago

Pinging down

angelo•11mo ago

again- gotcha

Deani1232OP•11mo ago

looks like it’s resolved now.

JustJake•11mo ago

Alrighty. Here's the response: We had a user on a custom domain create a mammoth amount of traffic which overwhelmed a small subset of boxes (aka 1) Unfortunately y'all were the unlocky folks on that box We've put up a PR + a monitor which will immediately page, per instance/domain/etc This will wake someone up automatically. They now have a 1 line way to fix this, so, even in the rare rare event that this happens again, it will be resolved in less than 5 minutes We're rebuilding the proxying layer to allow us to do fully, domain based, RPS configurations per domain. This will be live in the next month or so Lmk if you have any questions on the above @blandthony, @Dean Irwin, or anybody else

Deani1232OP•11mo ago

Thank you for resolving the issue quickly by the way.

blandthony•11mo ago

Sg, thanks for the postmortem. Is there a way to get notified when the new proxy layer is productionalized?

angelo•11mo ago

Changelog, as we are open about all Infra changes under the hood every Friday and incoming impact, but if thats not enough we can talk about communication plan and how to keep you all in the loop.

JustJake•11mo ago

!remind me to update this thread in 690 hours

Duchess•11mo ago

Got it, I will remind you to update this thread at Fri, 16 Feb 2024 19:23:21 GMT

JustJake•11mo ago

!remind me to update this thread in 1358 hours

Duchess•11mo ago

Got it, I will remind you to update this thread at Fri, 15 Mar 2024 15:23:59 GMT

JustJake•11mo ago

🙂

Gaming

Programming

Getting intermittent connection errors on all services connected to my uptime kuma.