Hey I think someone is attacking our DOs

Hey I think someone is attacking our DOs:
67 Replies
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
anyone have ideas on how to verify this? we haven't deployed any code for months on DOs or things that connect to DOs IIRC, certainly not in the last ~40 min I can't even tell what DO it's from
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
the log isnt' particularly helpful either...
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
ther eisn't even an exception in the exception log
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
oh yeah that's weird... had to be an attack
No description
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
happened again in a burst, does not seem to be a isngle object so I might guess it's an infra issue on CF's side?
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
added the ID of the object, not all the same, very high cardinality
No description
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
we are getting hundreds of thousands of these tens per second
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
No description
Milan
Milanβ€’15mo ago
Can you DM me your account ID, I can take a look to see if something is weird
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
yep dm'd, discord might have suppressed the DM notification @milan_cf
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
had another burst a bit ago too
No description
Milan
Milanβ€’15mo ago
I got the dm, looking This started about an hour ago? 18:30 utc or so? Ah no I see, from 17:50 onwards, definitely see an increase in # invocations
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
Yeah just before then And there’s no reason any individual durable object should be getting even within 20 or 30% of that connection limit Or rather 20% of that limit Largest we know of would be 7k peak
Unknown User
Unknown Userβ€’15mo ago
Message Not Public
Sign In & Join Server To View
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
Alerts Cloudflare log push to a Grafana stack
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
another one
No description
Milan
Milanβ€’15mo ago
Yeah, I see a linear increase in the number of websocket connections to some of your DOs
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
how high is the cardinality? Seemed pretty high from the logs when I added the IDs to the urls in the workers (query param) I also checked twitch and none of the 4 people above that viewer count are using our tools
Milan
Milanβ€’15mo ago
Monitoring of ws hibernation isn't really good enough to know how many DO instances were hitting that limit, I can see there were about 900 instances, most of which received very few messages (under 10), between 17:50utc and 20:00utc (so the last 2 hours) Bit over 700 of them received 10 or fewer messages
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
that's expected
Milan
Milanβ€’15mo ago
How many connections are you generally expecting to get to a single DO instance? How are you connecting to the DO?
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
@milan_cf typically <100, but some can be in the thousands. We are connecting through browser websockets
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
ah more
No description
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
@milan_cf any luck? It's pretty constant right now
Milan
Milanβ€’15mo ago
We don't think the issue is with our infrastructure, there is a significant increase in invocations to your durable object namespace starting at 17:50 UTC, and its been sustained for a while. It's likely that someone is opening a lot of websocket connections to your DOs and forcing you to hit the connection limit.
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
Gotcha, so perhaps an attack then because we pushed no code that modifies how we connect to clients. In my checking of the pretty unhelpful exception logs I do see they are quite spread out among US and EU, but a large number coming from eastern EU. Is there any way to check that? Logpush does not give us that info and this exception happens before our code runs it seems Unless maybe we can get that info in the worker before connecting to the DO?
Milan
Milanβ€’15mo ago
this exception happens before our code runs it seems
You mean logpush or the DO code? This exception should be from acceptWebSocket() throwing in the DO
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
see this there's no exception in the exceptions array maybe that's not the right exception, but then what is that exception lol
Milan
Milanβ€’15mo ago
Where did you get the 32k connection limit exception then?
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
logpush that message from the screenshot is literally all logpush was sending, so I looked into the function call logs for more info
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
No description
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
it also comes in waves completely uncharacteristic of any behavior from normal users of our app
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
and it seems a bit too constant to be our users
No description
Milan
Milanβ€’15mo ago
not sure why there's no exception there... I can ask around tomorrow (everyone is currently out for the day). I still think this is some sort of attack, but we definitely need to improve out hibernatable ws monitoring. It's probably worth wrapping acceptWebSocket in a try catch, or tracking how many ws you have connected and refusing to allow more to connect if you're near the limit (to avoid errors).
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
@milan_cf that's what I just pinged my team we are going to do (is try catch that and log our own error) @.hades32 fyi (my team) just added some try catch and additinal logging
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
ugh perfect timing to stop...
No description
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
nothing in the logs still... trying to return a valid response to see if the exceptions go away
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
does not seem to be that @milan_cf , as these errors are still blank
No description
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
idk what these exceptions even are...
try {
this.state.acceptWebSocket(pair[1], [t]);
} catch (error) {
console.error("error accepting websocket:", JSON.stringify(error), JSON.stringify(request.headers))
throw error
}
try {
this.state.acceptWebSocket(pair[1], [t]);
} catch (error) {
console.error("error accepting websocket:", JSON.stringify(error), JSON.stringify(request.headers))
throw error
}
this doesn't seem to be firing I think those exceptions are unrelated, we aren't getting the conneciton limit log right now, so that seems to be a second issue we've found @milan_cf
Milan
Milanβ€’15mo ago
probably stopped because of reload of all your DOs?
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
No it's still happening been seeing it all night
Milan
Milanβ€’15mo ago
are you throwing an exception in your code somewhere?
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
I added that code snippet above but it's never being reached, and there are no places that we are throwing an exception ourselves
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
ok for some reason as of a few hours ago that error started throwing
No description
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
idk why that didn't show last night though
Milan
Milanβ€’15mo ago
It looks like almost all your DOs are returning only 201s, a small percentage returning 400s
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
but it doesn't have any request headers? I'm fixing the log to get the headers, and the query
Milan
Milanβ€’15mo ago
Yeah so I went back a couple days and that namespace has only been responding with 201s and 400s, mostly 201s though also be back in 10 I'm getting coffee
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
no worries, it stopped a few min before I pushed out the update I'll sanity check our connection code, but we have nobody above 2k live viewers right now so nobody should be even remotely near that limit I wonder if it's actually a side-effect of an attack on twitch, because it is through our twitch extension which gets loaded whe the twitch page loads I can see these error logs have our JWT from twitch I think we might have figured it out, it seems that when we navigate we open a new socket but do not close new ones... for some reason the browser is keeping them around for 1-2 minutes now the rate of logs could be logpush throttling how fast they are sent, it looked like about 100/s which is the same limit that exists in the CF dashboard for viewing function logs
Milan
Milanβ€’15mo ago
Mind expanding on this? I'm not familiar with what the DOs are doing or how the client works and I'm curious
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
@milan_cf Sure, basically we are using them as coordination, we use HTMX so a navigation is replacing the component that connects via websockets. However for some reason that's not disonnecting, we deployed what we think is a temporary fix Basically I think every time our users did something they opened another socket
Milan
Milanβ€’15mo ago
Did it fix the issue?
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
@milan_cf The connection limit one yeah, but not entirely, I think that was one issue but I think there still is an attack. We added code to verify that the Twitch token was passed and is valid, and this error happens when no token is passed in
No description
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
Now we are rejecting it before accepting the socket, so that removes the connection limit issue, but it still seems like there are similar patterns. And our users are passing tokens (this is the last 12 hours) perhaps these are viewers not logged in that are viewing though lol, but yeah connection issue solved. But the pattern is just so constant, it doesn't feel like our users No, twitch gives a token regardless, this doens't seem to be our users
Milan
Milanβ€’15mo ago
@danthegoodman we found a regression in the hibernation code regarding dispatching the close handler (+ dropping the websocket) upon client disconnects and we're investigating further. Not certain it's affecting you but I suspect it probably is, will keep you updated as I find more
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
Appreciate the update! I wonder if the browser was not getting a close and thus kept reconnecting, as we never had this issue before hibernation
Milan
Milanβ€’15mo ago
GitHub
πŸ› Bug Report β€” Runtime APIs: Hibernating WebSockets remain open whe...
Problem We started observing the following strange behaviour some time yesterday. when using Durable Objects Hibernating WebSockets (calling state.acceptWebSocket(socket), when connecting via web b...
Milan
Milanβ€’15mo ago
I think the release went out, have the issues been resolved?
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
@milan_cf not sure, we added our own code to prevent this in the meantime that has worked for us so far
Milan
Milanβ€’15mo ago
Oh I thought you were still seeing problems regardless of that fix, my bad
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
no worries lol, we had to do 2 fixes were other users hitting this? or are we the largest websocket hibernation user πŸ€”
Milan
Milanβ€’15mo ago
Yeah, it seems like it hit a couple other folks as well. Unfortunately our test in CI didn't verify if the disconnect handler ran, it only confirmed that when it ran everything worked as expected. That coupled with lack of hibernatable ws monitoring made this tricky to confirm w/o reports from users We fixed the test case + are working on making this class of bugs discoverable at compile time. Will also need to think about some monitoring and metrics for hibernatable websockets
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
awesome
Milan
Milanβ€’15mo ago
Not the largest but definitely up there πŸ™‚
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
😎 ill take it
Milan
Milanβ€’15mo ago
Sorry for the trouble, and thanks for the detailed report. We haven't had a larger scale issue w/ hibernation so this will help us with our tooling going forward
DanTheGoodman
DanTheGoodmanOPβ€’15mo ago
glad it's all sorted!
Want results from more Discord servers?
Add your server