R
Railway•11mo ago
Deani1232

Getting intermittent connection errors on all services connected to my uptime kuma.

Network errors? They seem to be temporary but consistent. Every minute or so. Uptime kuma can't connect to my app server and my n8n servers can't connect to uptime kuma.
No description
No description
55 Replies
Percy
Percy•11mo ago
Project ID: N/A
Percy
Percy•11mo ago
Project ID: N/A
Deani1232
Deani1232OP•11mo ago
n/a Project of Uptimekuma: 7d1c4d0a-d9b2-4143-aa29-f775a92a5c6e
Medim
Medim•11mo ago
Do you have app sleeping on by any chance?
Deani1232
Deani1232OP•11mo ago
Where would I check that? AFAIK, our app servers shouldn't ever sleep.
kylegill (kg)
kylegill (kg)•11mo ago
I came here to report the same, we're on FastAPI on a single instance (no app sleeping), and having been getting errors all morning (for the last hour and a half):
upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: Cannot assign requested address
upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: immediate connect error: Cannot assign requested address
I've only been able to replicate on a few occasions, and retires almost always fix it. They are intermittent. We had no code changes to our app server. If helpful our project ID is: 1d3ab213-7e4f-45f6-a136-1326d80d0606
Deani1232
Deani1232OP•11mo ago
This issue only started happening about 2 hours ago, and happens on all services at once.
Medim
Medim•11mo ago
that's weird 🤔 are you guys using private network?
Deani1232
Deani1232OP•11mo ago
No
Medim
Medim•11mo ago
I remember this exact same issue in a thread not too long ago https://discord.com/channels/713503345364697088/846875565357006878/1197588980653375588 well maybe it's this?
kylegill (kg)
kylegill (kg)•11mo ago
We don't use private networking (unless there is something automatically configured for us)
Deani1232
Deani1232OP•11mo ago
Our uptime kuma connections all use the public railway urls.
kylegill (kg)
kylegill (kg)•11mo ago
We were alerted by Checkly that we use to monitor our public customer-facing API endpoints. We thought maybe something was wrong on their end, but have since replicated myself on my machine. I came here when that happened, and wondered if it was networking that we didn't control, since we haven' t pushed any code to our affected service in the last week.
johns
johns•11mo ago
Same. We are using private network
Adam
Adam•11mo ago
seems like #🚨|incidents is the cause nothing we can do but wait for the team to solve it
blandthony
blandthony•11mo ago
Jumping in here as well (x-posted here: https://discord.com/channels/713503345364697088/1197588560564469760/1197588560564469760), we lost 2 leads due to broken demos this morning bc of this
johns
johns•11mo ago
@David @Ray
Medim
Medim•11mo ago
#🛂|readme 5) Don't ping team or conductors Theres a incident going on, wait for it to get resolved
blandthony
blandthony•11mo ago
Would appreciate a follow up at least
Medim
Medim•11mo ago
Yeah, lets just wait for the incident to get resolved
johns
johns•11mo ago
I think I joined the channel before they had that readme. Didn't know
Medim
Medim•11mo ago
And see if your issue persists np, just be careful
blandthony
blandthony•11mo ago
Why is this not more urgent, our whole site is down https://www.toma.so/
Adam
Adam•11mo ago
It is urgent. The team is working on it locking this thread now, don’t ping the team. You’re distracting them from solving the issue
angelo
angelo•11mo ago
You al should be noticing connections restore. Check now for restoration.
Deani1232
Deani1232OP•11mo ago
Still pinging down
angelo
angelo•11mo ago
Yep- new flood. Updating
blandthony
blandthony•11mo ago
Yeah still down
angelo
angelo•11mo ago
okay- seeing connections restore on our end will be 5-15 minutes till DNS resolves @Dean Irwin - looking good now? also @kylegill (kg)
Deani1232
Deani1232OP•11mo ago
yes my pager has stopped flipping.
kylegill (kg)
kylegill (kg)•11mo ago
We haven't detected any alerts since 11:50am MST (18 min ago)
angelo
angelo•11mo ago
Our proxy cpu usage is now back to normal- either the DDoS stopped or we swung the hammer enough times to teach them (bad actors) a lesson Okay Good, will update to monitoring
Deani1232
Deani1232OP•11mo ago
Ik we can’t help DDoS attacks from happening but there has to be a way to improve reliability here
Deani1232
Deani1232OP•11mo ago
I can not and will not continue to keep getting questions like this
No description
blandthony
blandthony•11mo ago
Same, we're gearing up to likely switch to Porter because we are losing customer trust over things like this that we can't control I'd rather our company get DDoS'd directly as we can do something about it
Deani1232
Deani1232OP•11mo ago
And yes, I responded to that query with only great things to say about railway, and how much time the infrastructure saves me. I am on your team.
angelo
angelo•11mo ago
Understood, and these are not empty words that it pains me as well when I know that this impacts your business. I am jumping on the call with the infrastructure team in 5 minutes and we are going to conduct a full retrospective. I will share what went down, and how we plan to avoid something like this in the future. With that said: I want you to do what is best for your business and not what is best for Railway. If you need to migrate, we can help you with that or help you configure your environments that your prodction is hardened. In Toma's case, you can invite me to your Slack and we can talk next steps.
blandthony
blandthony•11mo ago
Sounds good, will do. Really appreciate your hard work
JustJake
JustJake•11mo ago
FWIW I can almost guarantee you even more pain and equivalent outages for managing a kubernetes cluster Make no mistake, we're not saying the above is acceptable. Just grass is greener type stuff We'll provide a retro on this in about 15 minutes (writing stuff up) (Background: Envoy powers all the HTTP requests. It's being removed in a favor of a more resillient proxy we built in house)
blandthony
blandthony•11mo ago
Yeah definitely, those are my reservations about moving into K8s ^. We're a one man dev team atm (just me) so whatever guarantees a combination of minimum downtime and ease of use is what we'll go with
JustJake
JustJake•11mo ago
Sounds good. We'll talk traffic steering with the team and like I said, give us 15m to get together to retro 30 minutes to get back to you
blandthony
blandthony•11mo ago
Thanks for the transparency too. We're also on your side and in 95% of cases, Railway has worked great for us. It's just that this came at the worst timing for us as a business and now we're a bit shell-shocked
angelo
angelo•11mo ago
Even if you aren't we need to know how critical your workloads are so we can best plan for that as well. I appreciate you sharing this.
Deani1232
Deani1232OP•11mo ago
Pinging down
angelo
angelo•11mo ago
again- gotcha
Deani1232
Deani1232OP•11mo ago
looks like it’s resolved now.
JustJake
JustJake•11mo ago
Alrighty. Here's the response: We had a user on a custom domain create a mammoth amount of traffic which overwhelmed a small subset of boxes (aka 1) Unfortunately y'all were the unlocky folks on that box We've put up a PR + a monitor which will immediately page, per instance/domain/etc This will wake someone up automatically. They now have a 1 line way to fix this, so, even in the rare rare event that this happens again, it will be resolved in less than 5 minutes We're rebuilding the proxying layer to allow us to do fully, domain based, RPS configurations per domain. This will be live in the next month or so Lmk if you have any questions on the above @blandthony, @Dean Irwin, or anybody else
Deani1232
Deani1232OP•11mo ago
Thank you for resolving the issue quickly by the way.
blandthony
blandthony•11mo ago
Sg, thanks for the postmortem. Is there a way to get notified when the new proxy layer is productionalized?
angelo
angelo•11mo ago
Changelog, as we are open about all Infra changes under the hood every Friday and incoming impact, but if thats not enough we can talk about communication plan and how to keep you all in the loop.
JustJake
JustJake•11mo ago
!remind me to update this thread in 690 hours
Duchess
Duchess•11mo ago
Got it, I will remind you to update this thread at Fri, 16 Feb 2024 19:23:21 GMT
JustJake
JustJake•11mo ago
!remind me to update this thread in 1358 hours
Duchess
Duchess•11mo ago
Got it, I will remind you to update this thread at Fri, 15 Mar 2024 15:23:59 GMT
JustJake
JustJake•11mo ago
🙂
Want results from more Discord servers?
Add your server