Getting intermittent connection errors on all services connected to my uptime kuma.
Network errors? They seem to be temporary but consistent. Every minute or so. Uptime kuma can't connect to my app server and my n8n servers can't connect to uptime kuma.
55 Replies
Project ID:
N/A
Project ID:
N/A
n/a
Project of Uptimekuma: 7d1c4d0a-d9b2-4143-aa29-f775a92a5c6e
Do you have app sleeping on by any chance?
Where would I check that? AFAIK, our app servers shouldn't ever sleep.
I came here to report the same, we're on FastAPI on a single instance (no app sleeping), and having been getting errors all morning (for the last hour and a half):
I've only been able to replicate on a few occasions, and retires almost always fix it.
They are intermittent. We had no code changes to our app server. If helpful our project ID is:
1d3ab213-7e4f-45f6-a136-1326d80d0606
This issue only started happening about 2 hours ago, and happens on all services at once.
that's weird 🤔
are you guys using private network?
No
I remember this exact same issue in a thread not too long ago
https://discord.com/channels/713503345364697088/846875565357006878/1197588980653375588
well
maybe it's this?
We don't use private networking (unless there is something automatically configured for us)
Our uptime kuma connections all use the public railway urls.
We were alerted by Checkly that we use to monitor our public customer-facing API endpoints. We thought maybe something was wrong on their end, but have since replicated myself on my machine.
I came here when that happened, and wondered if it was networking that we didn't control, since we haven' t pushed any code to our affected service in the last week.
Same. We are using private network
seems like #🚨|incidents is the cause
nothing we can do but wait for the team to solve it
Jumping in here as well (x-posted here: https://discord.com/channels/713503345364697088/1197588560564469760/1197588560564469760), we lost 2 leads due to broken demos this morning bc of this
@David
@Ray
#🛂|readme 5) Don't ping team or conductors
Theres a incident going on, wait for it to get resolved
Would appreciate a follow up
at least
Yeah, lets just wait for the incident to get resolved
I think I joined the channel before they had that readme. Didn't know
And see if your issue persists
np, just be careful
Why is this not more urgent, our whole site is down https://www.toma.so/
It is urgent. The team is working on it
locking this thread now, don’t ping the team. You’re distracting them from solving the issue
You al should be noticing connections restore.
Check now for restoration.
Still pinging down
Yep- new flood.
Updating
Yeah still down
okay- seeing connections restore on our end
will be 5-15 minutes till DNS resolves
@Dean Irwin - looking good now?
also @kylegill (kg)
yes my pager has stopped flipping.
We haven't detected any alerts since 11:50am MST (18 min ago)
Our proxy cpu usage is now back to normal- either the DDoS stopped or we swung the hammer enough times to teach them (bad actors) a lesson
Okay
Good, will update to monitoring
Ik we can’t help DDoS attacks from happening but there has to be a way to improve reliability here
I can not and will not continue to keep getting questions like this
Same, we're gearing up to likely switch to Porter because we are losing customer trust over things like this that we can't control
I'd rather our company get DDoS'd directly as we can do something about it
And yes, I responded to that query with only great things to say about railway, and how much time the infrastructure saves me. I am on your team.
Understood, and these are not empty words that it pains me as well when I know that this impacts your business. I am jumping on the call with the infrastructure team in 5 minutes and we are going to conduct a full retrospective. I will share what went down, and how we plan to avoid something like this in the future.
With that said: I want you to do what is best for your business and not what is best for Railway. If you need to migrate, we can help you with that or help you configure your environments that your prodction is hardened. In Toma's case, you can invite me to your Slack and we can talk next steps.
Sounds good, will do. Really appreciate your hard work
FWIW I can almost guarantee you even more pain and equivalent outages for managing a kubernetes cluster
Make no mistake, we're not saying the above is acceptable. Just grass is greener type stuff
We'll provide a retro on this in about 15 minutes (writing stuff up)
(Background: Envoy powers all the HTTP requests. It's being removed in a favor of a more resillient proxy we built in house)
Yeah definitely, those are my reservations about moving into K8s ^. We're a one man dev team atm (just me) so whatever guarantees a combination of minimum downtime and ease of use is what we'll go with
Sounds good. We'll talk traffic steering with the team and like I said, give us 15m to get together to retro
30 minutes to get back to you
Thanks for the transparency too. We're also on your side and in 95% of cases, Railway has worked great for us. It's just that this came at the worst timing for us as a business and now we're a bit shell-shocked
Even if you aren't we need to know how critical your workloads are so we can best plan for that as well. I appreciate you sharing this.
Pinging down
again- gotcha
looks like it’s resolved now.
Alrighty. Here's the response:
We had a user on a custom domain create a mammoth amount of traffic which overwhelmed a small subset of boxes (aka 1)
Unfortunately y'all were the unlocky folks on that box
We've put up a PR + a monitor which will immediately page, per instance/domain/etc
This will wake someone up automatically. They now have a 1 line way to fix this, so, even in the rare rare event that this happens again, it will be resolved in less than 5 minutes
We're rebuilding the proxying layer to allow us to do fully, domain based, RPS configurations per domain. This will be live in the next month or so
Lmk if you have any questions on the above @blandthony, @Dean Irwin, or anybody else
Thank you for resolving the issue quickly by the way.
Sg, thanks for the postmortem. Is there a way to get notified when the new proxy layer is productionalized?
Changelog, as we are open about all Infra changes under the hood every Friday and incoming impact, but if thats not enough we can talk about communication plan and how to keep you all in the loop.
!remind me to update this thread in 690 hours
Got it, I will remind you to
update this thread
at Fri, 16 Feb 2024 19:23:21 GMT
!remind me to update this thread in 1358 hours
Got it, I will remind you to
update this thread
at Fri, 15 Mar 2024 15:23:59 GMT
🙂