67 Replies
anyone have ideas on how to verify this?
we haven't deployed any code for months on DOs
or things that connect to DOs IIRC, certainly not in the last ~40 min
I can't even tell what DO it's from
the log isnt' particularly helpful either...
ther eisn't even an exception in the exception log
oh yeah that's weird... had to be an attack
happened again in a burst, does not seem to be a isngle object so I might guess it's an infra issue on CF's side?
added the ID of the object, not all the same, very high cardinality
we are getting hundreds of thousands of these
tens per second
Can you DM me your account ID, I can take a look to see if something is weird
yep
dm'd, discord might have suppressed the DM notification @milan_cf
had another burst a bit ago too
I got the dm, looking
This started about an hour ago?
18:30 utc or so?
Ah no I see, from 17:50 onwards, definitely see an increase in # invocations
Yeah just before then
And thereβs no reason any individual durable object should be getting even within 20 or 30% of that connection limit
Or rather 20% of that limit
Largest we know of would be 7k peak
Unknown Userβ’15mo ago
Message Not Public
Sign In & Join Server To View
Alerts
Cloudflare log push to a Grafana stack
another one
Yeah, I see a linear increase in the number of websocket connections to some of your DOs
how high is the cardinality? Seemed pretty high from the logs when I added the IDs to the urls in the workers
(query param)
I also checked twitch and none of the 4 people above that viewer count are using our tools
Monitoring of ws hibernation isn't really good enough to know how many DO instances were hitting that limit, I can see there were about 900 instances, most of which received very few messages (under 10), between 17:50utc and 20:00utc (so the last 2 hours)
Bit over 700 of them received 10 or fewer messages
that's expected
How many connections are you generally expecting to get to a single DO instance? How are you connecting to the DO?
@milan_cf typically <100, but some can be in the thousands. We are connecting through browser websockets
ah more
@milan_cf any luck? It's pretty constant right now
We don't think the issue is with our infrastructure, there is a significant increase in invocations to your durable object namespace starting at 17:50 UTC, and its been sustained for a while. It's likely that someone is opening a lot of websocket connections to your DOs and forcing you to hit the connection limit.
Gotcha, so perhaps an attack then because we pushed no code that modifies how we connect to clients. In my checking of the pretty unhelpful exception logs I do see they are quite spread out among US and EU, but a large number coming from eastern EU. Is there any way to check that? Logpush does not give us that info and this exception happens before our code runs it seems
Unless maybe we can get that info in the worker before connecting to the DO?
this exception happens before our code runs it seemsYou mean logpush or the DO code? This exception should be from
acceptWebSocket()
throwing in the DOsee this
there's no exception in the exceptions array
maybe that's not the right exception, but then what is that exception lol
Where did you get the 32k connection limit exception then?
logpush
that message from the screenshot is literally all logpush was sending, so I looked into the function call logs for more info
it also comes in waves completely uncharacteristic of any behavior from normal users of our app
and it seems a bit too constant to be our users
not sure why there's no exception there... I can ask around tomorrow (everyone is currently out for the day). I still think this is some sort of attack, but we definitely need to improve out hibernatable ws monitoring. It's probably worth wrapping acceptWebSocket in a try catch, or tracking how many ws you have connected and refusing to allow more to connect if you're near the limit (to avoid errors).
@milan_cf that's what I just pinged my team we are going to do (is try catch that and log our own error)
@.hades32 fyi (my team)
just added some try catch and additinal logging
ugh perfect timing to stop...
nothing in the logs still... trying to return a valid response to see if the exceptions go away
does not seem to be that @milan_cf , as these errors are still blank
idk what these exceptions even are...
this doesn't seem to be firing
I think those exceptions are unrelated, we aren't getting the conneciton limit log right now, so that seems to be a second issue we've found @milan_cf
probably stopped because of reload of all your DOs?
No it's still happening
been seeing it all night
are you throwing an exception in your code somewhere?
I added that code snippet above but it's never being reached, and there are no places that we are throwing an exception ourselves
ok for some reason as of a few hours ago that error started throwing
idk why that didn't show last night though
It looks like almost all your DOs are returning only 201s, a small percentage returning 400s
but it doesn't have any request headers?
I'm fixing the log to get the headers, and the query
Yeah so I went back a couple days and that namespace has only been responding with 201s and 400s, mostly 201s though
also be back in 10 I'm getting coffee
no worries, it stopped a few min before I pushed out the update
I'll sanity check our connection code, but we have nobody above 2k live viewers right now so nobody should be even remotely near that limit
I wonder if it's actually a side-effect of an attack on twitch, because it is through our twitch extension which gets loaded whe the twitch page loads
I can see these error logs have our JWT from twitch
I think we might have figured it out, it seems that when we navigate we open a new socket but do not close new ones... for some reason the browser is keeping them around for 1-2 minutes
now the rate of logs could be logpush throttling how fast they are sent, it looked like about 100/s which is the same limit that exists in the CF dashboard for viewing function logs
Mind expanding on this? I'm not familiar with what the DOs are doing or how the client works and I'm curious
@milan_cf Sure, basically we are using them as coordination, we use HTMX so a navigation is replacing the component that connects via websockets. However for some reason that's not disonnecting, we deployed what we think is a temporary fix
Basically I think every time our users did something they opened another socket
Did it fix the issue?
@milan_cf The connection limit one yeah, but not entirely, I think that was one issue but I think there still is an attack. We added code to verify that the Twitch token was passed and is valid, and this error happens when no token is passed in
Now we are rejecting it before accepting the socket, so that removes the connection limit issue, but it still seems like there are similar patterns. And our users are passing tokens (this is the last 12 hours)
perhaps these are viewers not logged in that are viewing though lol, but yeah connection issue solved. But the pattern is just so constant, it doesn't feel like our users
No, twitch gives a token regardless, this doens't seem to be our users
@danthegoodman we found a regression in the hibernation code regarding dispatching the close handler (+ dropping the websocket) upon client disconnects and we're investigating further. Not certain it's affecting you but I suspect it probably is, will keep you updated as I find more
Appreciate the update!
I wonder if the browser was not getting a close and thus kept reconnecting, as we never had this issue before hibernation
GitHub
π Bug Report β Runtime APIs: Hibernating WebSockets remain open whe...
Problem We started observing the following strange behaviour some time yesterday. when using Durable Objects Hibernating WebSockets (calling state.acceptWebSocket(socket), when connecting via web b...
I think the release went out, have the issues been resolved?
@milan_cf not sure, we added our own code to prevent this in the meantime that has worked for us so far
Oh I thought you were still seeing problems regardless of that fix, my bad
no worries lol, we had to do 2 fixes
were other users hitting this?
or are we the largest websocket hibernation user π€
Yeah, it seems like it hit a couple other folks as well. Unfortunately our test in CI didn't verify if the disconnect handler ran, it only confirmed that when it ran everything worked as expected. That coupled with lack of hibernatable ws monitoring made this tricky to confirm w/o reports from users
We fixed the test case + are working on making this class of bugs discoverable at compile time. Will also need to think about some monitoring and metrics for hibernatable websockets
awesome
Not the largest but definitely up there π
π ill take it
Sorry for the trouble, and thanks for the detailed report. We haven't had a larger scale issue w/ hibernation so this will help us with our tooling going forward
glad it's all sorted!