Is sending to WebSockets from a DO fully blocking?
We have a simple DO that clients connect to via WS and we use hibernation. The DO has an endpoint that our backend can hit to send out a broadcast to all connected clients.
Sometimes the trigger http call times out, even after 5 retries with backoff. My first attempt to fixing this was to change the trigger endpoint to just schedule an alarm, which then handles the sending, but that didn't help.
I assume that is because the DO is single-threaded and so, if the alarm is running, all incoming requests are blocked.
Today I noticed that this happened even though only ~10 connections were open, so the only explanation I have is that one of them is semi-broken, or just very slow, and that the send() call is fully blocking.
Is that the case? If so, why? This seems to a unnecessarily strict, as sending a WS message could just put it into an outbound queue.
9 Replies
@Helpflare who can help with this?
Sending a websocket message does put it into a queue, it doesn't block.
Your backend sends a broadcast request to your DO (the broadcast endpoint), and then you do what exactly? Get back all your websockets and send them each a message? wdym by "http call times out"? Are you able to confirm that any of the ws get the broadcasted message?
Yes exactly, the endpoint just forwards the fetch to the DO and the DO then iterates over all the websockets and sends each the same message.
And that fetch then times out while waiting for the response - which would just be an empty 200.
So all of this should be basically instant, so I don't understand why we sometimes get timeouts
As I can't reproduce it, I also don't know of some messages get out
The only time I could reproduce it, I had just closed the websocket client, so I don't know if it WOULD have received the message.
But as this case triggered the issue, I assumed it might either be blocking or at least that the fetch only returns when the DO is done with all WS messages
Also the send() message sometimes throws an exception that the WS is already closed. How does that work if that method only gets queued @milan
I'll check with the team but I'm fairly certain the response wouldn't be blocked by outbound messages waiting in the queue, that should just prevent the DO from being evicted.
If you get an exception from
send()
that the WS was already closed that means ws.close()
was called previously. IIRC it's possible that you called ws.close()
and the client hasn't responded with its own close message, so we wouldn't remove it from the list of hibernatable websockets. In that case, you could get it back via getWebSockets()
even though you can't send outbound messages anymore.Thanks, yeah the latter makes sense. Just the timeouts are a mystery to me and I don't know how else to debug
I assume you're
await
ing the fetch to the DO and that is what's timing out? Maybe wrap your ws.send()
in a try...catch
? Do you have a webSocketClose()
handler that closes the DO's websocket?We are try catching and then just ignoring that error. We do have that handler, but we don't do anything with them
This is the code. Had to rename from .ts
You're not timing out when you call broadcast via
alarm()
?
I don't know that we can do much without a repro here, you should be able to return a response before all messages are sent so send()
wouldn't block. Something else must be happening, maybe it was caused by additional work running in the constructor?