websocket-upgrade fetch from worker to DO randomly delayed

We are using DOs for a registry that coordinates the running of multi-user web sessions. We have "synchronizer" nodes that are external to the registry, each maintaining long-lived websockets into the registry for its housekeeping tasks. A synchronizer watches for any of its socket connections dropping, or responding too sluggishly. In such cases, it automatically re-connects by sending a wss request to our ingress worker, whose fetch delegates to methods like this:
function synchToSession(request: Request, env: Env, colo: string) {
if (request.headers.get("Upgrade") === "websocket") {
const sessionId = request.url.searchParams.get('session');
const runnerId = env.SESSION_RUNNER.idFromName(sessionId);
const sessionRunner = env.SESSION_RUNNER.get(runnerId);

console.log(`worker@${colo}: forwarding websocket`);

return sessionRunner.fetch(request);
}
}
function synchToSession(request: Request, env: Env, colo: string) {
if (request.headers.get("Upgrade") === "websocket") {
const sessionId = request.url.searchParams.get('session');
const runnerId = env.SESSION_RUNNER.idFromName(sessionId);
const sessionRunner = env.SESSION_RUNNER.get(runnerId);

console.log(`worker@${colo}: forwarding websocket`);

return sessionRunner.fetch(request);
}
}
...where the "session runner" DO has a fetch that boils down to:
async fetch(request: Request): Promise<Response> {
const { 0: clientSocket, 1: ourSocket } = new WebSocketPair();
ourSocket.accept();
// ...set up event handlers etc, then...
return new Response(null, { status: 101, webSocket: clientSocket });
async fetch(request: Request): Promise<Response> {
const { 0: clientSocket, 1: ourSocket } = new WebSocketPair();
ourSocket.accept();
// ...set up event handlers etc, then...
return new Response(null, { status: 101, webSocket: clientSocket });
Although the reconnections usually take of the order of 50ms, every few hours we hit periods when several synchronizers all detect a sluggish response and try to re-connect, and those reconnections are held up for a second or more before all completing at the same time. The worst cases have a delay of over 10 seconds. The logs show that almost the entire delay occurs between the worker's console message, and the subsequent GET log line for the DO.
For context: * the delays only rarely coincide with eviction and reload of DOs; generally the DOs are already active (i.e., no cold start involved). * there is no other significant traffic to our ingress or workers. How could we at least figure out where the time is going?
6 Replies
Nolan
Nolan7mo ago
Hi there! Thanks for reaching out and apologies for the delay here. Is it possible to get those logs sent to us? I just want to review what data we do have and try to go from there. I'm assuming it's rather difficult to replicate this and get some real time data?
thelunz
thelunzOP7mo ago
We submitted a support request on July 19th, and are now awaiting feedback - including on how we might best help the support team to pinpoint what's going on.
It's not clear that the logs would be of much help. For a few hours everything is orderly - with continuous low-level chatter from the housekeeping tasks, and occasional hiccoughs (typically due to minor network stutter) resolving themselves in 100ms or so... and then we hit one of the dead patches, where everything temporarily grinds to a halt. From the perspective of our operations and the logging, these patches appear to be completely out of the blue.
Nolan
Nolan7mo ago
I gotcha, makes sense. Mind sending me that ticket number in DMs? I'll take a look at the status of that!
thelunz
thelunzOP7mo ago
@Nolan We received notification of the ticket being moved into the Salesforce tracker, as part of the big migration that's going on. We continue to see the destructive random delays, as before; is there some way we can work with the team to understand whether a resolution is even going to be possible?
Nolan
Nolan7mo ago
Hey there! Yeah I'll find the ticket again! The salesfoce move has been a bit disruptive so my apologies!
codefrau
codefrau6mo ago
Haha. I can’t access the ticket either, but @Vero 🐙 was looking into that. Thank you! @Nolan we added the info you requested to the ticket 3 weeks ago. Can you take a look again, please?

Did you find this page helpful?