Hey I think someone is attacking our DOs

Hey I think someone is attacking our DOs:

67 Replies

anyone have ideas on how to verify this? we haven't deployed any code for months on DOs or things that connect to DOs IIRC, certainly not in the last ~40 min I can't even tell what DO it's from

DanTheGoodmanOP•2y ago

the log isnt' particularly helpful either...

message.txt

DanTheGoodmanOP•2y ago

ther eisn't even an exception in the exception log

DanTheGoodmanOP•2y ago

oh yeah that's weird... had to be an attack

DanTheGoodmanOP•2y ago

happened again in a burst, does not seem to be a isngle object so I might guess it's an infra issue on CF's side?

DanTheGoodmanOP•2y ago

added the ID of the object, not all the same, very high cardinality

DanTheGoodmanOP•2y ago

we are getting hundreds of thousands of these tens per second

DanTheGoodmanOP•2y ago

Milan•2y ago

Can you DM me your account ID, I can take a look to see if something is weird

DanTheGoodmanOP•2y ago

yep dm'd, discord might have suppressed the DM notification @milan_cf

DanTheGoodmanOP•2y ago

had another burst a bit ago too

Milan•2y ago

I got the dm, looking This started about an hour ago? 18:30 utc or so? Ah no I see, from 17:50 onwards, definitely see an increase in # invocations

DanTheGoodmanOP•2y ago

Yeah just before then And there’s no reason any individual durable object should be getting even within 20 or 30% of that connection limit Or rather 20% of that limit Largest we know of would be 7k peak

Unknown User•2y ago

Message Not Public

DanTheGoodmanOP•2y ago

Alerts Cloudflare log push to a Grafana stack

DanTheGoodmanOP•2y ago

another one

Milan•2y ago

Yeah, I see a linear increase in the number of websocket connections to some of your DOs

DanTheGoodmanOP•2y ago

how high is the cardinality? Seemed pretty high from the logs when I added the IDs to the urls in the workers (query param) I also checked twitch and none of the 4 people above that viewer count are using our tools

Milan•2y ago

Monitoring of ws hibernation isn't really good enough to know how many DO instances were hitting that limit, I can see there were about 900 instances, most of which received very few messages (under 10), between 17:50utc and 20:00utc (so the last 2 hours) Bit over 700 of them received 10 or fewer messages

DanTheGoodmanOP•2y ago

that's expected

Milan•2y ago

How many connections are you generally expecting to get to a single DO instance? How are you connecting to the DO?

DanTheGoodmanOP•2y ago

@milan_cf typically <100, but some can be in the thousands. We are connecting through browser websockets

DanTheGoodmanOP•2y ago

ah more

DanTheGoodmanOP•2y ago

@milan_cf any luck? It's pretty constant right now

Milan•2y ago

We don't think the issue is with our infrastructure, there is a significant increase in invocations to your durable object namespace starting at 17:50 UTC, and its been sustained for a while. It's likely that someone is opening a lot of websocket connections to your DOs and forcing you to hit the connection limit.

DanTheGoodmanOP•2y ago

Gotcha, so perhaps an attack then because we pushed no code that modifies how we connect to clients. In my checking of the pretty unhelpful exception logs I do see they are quite spread out among US and EU, but a large number coming from eastern EU. Is there any way to check that? Logpush does not give us that info and this exception happens before our code runs it seems Unless maybe we can get that info in the worker before connecting to the DO?

Milan•2y ago

this exception happens before our code runs it seems

You mean logpush or the DO code? This exception should be from acceptWebSocket() throwing in the DO

DanTheGoodmanOP•2y ago

see this there's no exception in the exceptions array maybe that's not the right exception, but then what is that exception lol

Milan•2y ago

Where did you get the 32k connection limit exception then?

DanTheGoodmanOP•2y ago

logpush that message from the screenshot is literally all logpush was sending, so I looked into the function call logs for more info

DanTheGoodmanOP•2y ago

it also comes in waves completely uncharacteristic of any behavior from normal users of our app

DanTheGoodmanOP•2y ago

and it seems a bit too constant to be our users

Milan•2y ago

not sure why there's no exception there... I can ask around tomorrow (everyone is currently out for the day). I still think this is some sort of attack, but we definitely need to improve out hibernatable ws monitoring. It's probably worth wrapping acceptWebSocket in a try catch, or tracking how many ws you have connected and refusing to allow more to connect if you're near the limit (to avoid errors).

DanTheGoodmanOP•2y ago

@milan_cf that's what I just pinged my team we are going to do (is try catch that and log our own error) @.hades32 fyi (my team) just added some try catch and additinal logging

DanTheGoodmanOP•2y ago

ugh perfect timing to stop...

DanTheGoodmanOP•2y ago

nothing in the logs still... trying to return a valid response to see if the exceptions go away

DanTheGoodmanOP•2y ago

does not seem to be that @milan_cf , as these errors are still blank

DanTheGoodmanOP•2y ago

idk what these exceptions even are...

try {
        this.state.acceptWebSocket(pair[1], [t]);
      } catch (error) {
        console.error("error accepting websocket:", JSON.stringify(error), JSON.stringify(request.headers))
        throw error
      }

try {
        this.state.acceptWebSocket(pair[1], [t]);
      } catch (error) {
        console.error("error accepting websocket:", JSON.stringify(error), JSON.stringify(request.headers))
        throw error
      }

this doesn't seem to be firing I think those exceptions are unrelated, we aren't getting the conneciton limit log right now, so that seems to be a second issue we've found @milan_cf

Milan•2y ago

probably stopped because of reload of all your DOs?

DanTheGoodmanOP•2y ago

No it's still happening been seeing it all night

Milan•2y ago

are you throwing an exception in your code somewhere?

DanTheGoodmanOP•2y ago

I added that code snippet above but it's never being reached, and there are no places that we are throwing an exception ourselves

DanTheGoodmanOP•2y ago

ok for some reason as of a few hours ago that error started throwing

DanTheGoodmanOP•2y ago

idk why that didn't show last night though

Milan•2y ago

It looks like almost all your DOs are returning only 201s, a small percentage returning 400s

DanTheGoodmanOP•2y ago

but it doesn't have any request headers? I'm fixing the log to get the headers, and the query

Milan•2y ago

Yeah so I went back a couple days and that namespace has only been responding with 201s and 400s, mostly 201s though also be back in 10 I'm getting coffee

DanTheGoodmanOP•2y ago

no worries, it stopped a few min before I pushed out the update I'll sanity check our connection code, but we have nobody above 2k live viewers right now so nobody should be even remotely near that limit I wonder if it's actually a side-effect of an attack on twitch, because it is through our twitch extension which gets loaded whe the twitch page loads I can see these error logs have our JWT from twitch I think we might have figured it out, it seems that when we navigate we open a new socket but do not close new ones... for some reason the browser is keeping them around for 1-2 minutes now the rate of logs could be logpush throttling how fast they are sent, it looked like about 100/s which is the same limit that exists in the CF dashboard for viewing function logs

Milan•2y ago

Mind expanding on this? I'm not familiar with what the DOs are doing or how the client works and I'm curious

DanTheGoodmanOP•2y ago

@milan_cf Sure, basically we are using them as coordination, we use HTMX so a navigation is replacing the component that connects via websockets. However for some reason that's not disonnecting, we deployed what we think is a temporary fix Basically I think every time our users did something they opened another socket

Milan•2y ago

Did it fix the issue?

DanTheGoodmanOP•2y ago

@milan_cf The connection limit one yeah, but not entirely, I think that was one issue but I think there still is an attack. We added code to verify that the Twitch token was passed and is valid, and this error happens when no token is passed in

DanTheGoodmanOP•2y ago

Now we are rejecting it before accepting the socket, so that removes the connection limit issue, but it still seems like there are similar patterns. And our users are passing tokens (this is the last 12 hours) perhaps these are viewers not logged in that are viewing though lol, but yeah connection issue solved. But the pattern is just so constant, it doesn't feel like our users No, twitch gives a token regardless, this doens't seem to be our users

Milan•2y ago

@danthegoodman we found a regression in the hibernation code regarding dispatching the close handler (+ dropping the websocket) upon client disconnects and we're investigating further. Not certain it's affecting you but I suspect it probably is, will keep you updated as I find more

DanTheGoodmanOP•2y ago

Appreciate the update! I wonder if the browser was not getting a close and thus kept reconnecting, as we never had this issue before hibernation

Milan•2y ago

https://github.com/cloudflare/workerd/issues/1187

GitHub

🐛 Bug Report — Runtime APIs: Hibernating WebSockets remain open whe...

Problem We started observing the following strange behaviour some time yesterday. when using Durable Objects Hibernating WebSockets (calling state.acceptWebSocket(socket), when connecting via web b...

Milan•2y ago

I think the release went out, have the issues been resolved?

DanTheGoodmanOP•2y ago

@milan_cf not sure, we added our own code to prevent this in the meantime that has worked for us so far

Milan•2y ago

Oh I thought you were still seeing problems regardless of that fix, my bad

DanTheGoodmanOP•2y ago

no worries lol, we had to do 2 fixes were other users hitting this? or are we the largest websocket hibernation user 🤔

Milan•2y ago

Yeah, it seems like it hit a couple other folks as well. Unfortunately our test in CI didn't verify if the disconnect handler ran, it only confirmed that when it ran everything worked as expected. That coupled with lack of hibernatable ws monitoring made this tricky to confirm w/o reports from users We fixed the test case + are working on making this class of bugs discoverable at compile time. Will also need to think about some monitoring and metrics for hibernatable websockets

DanTheGoodmanOP•2y ago

awesome

Milan•2y ago

Not the largest but definitely up there 🙂

DanTheGoodmanOP•2y ago

😎 ill take it

Milan•2y ago

Sorry for the trouble, and thanks for the detailed report. We haven't had a larger scale issue w/ hibernation so this will help us with our tooling going forward

DanTheGoodmanOP•2y ago

glad it's all sorted!

Gaming

Programming

Hey I think someone is attacking our DOs

Did you find this page helpful?