thelunz
thelunz
CDCloudflare Developers
Created by thelunz on 7/16/2024 in #workers-help
websocket-upgrade fetch from worker to DO randomly delayed
We are using DOs for a registry that coordinates the running of multi-user web sessions. We have "synchronizer" nodes that are external to the registry, each maintaining long-lived websockets into the registry for its housekeeping tasks. A synchronizer watches for any of its socket connections dropping, or responding too sluggishly. In such cases, it automatically re-connects by sending a wss request to our ingress worker, whose fetch delegates to methods like this:
function synchToSession(request: Request, env: Env, colo: string) {
if (request.headers.get("Upgrade") === "websocket") {
const sessionId = request.url.searchParams.get('session');
const runnerId = env.SESSION_RUNNER.idFromName(sessionId);
const sessionRunner = env.SESSION_RUNNER.get(runnerId);

console.log(`worker@${colo}: forwarding websocket`);

return sessionRunner.fetch(request);
}
}
function synchToSession(request: Request, env: Env, colo: string) {
if (request.headers.get("Upgrade") === "websocket") {
const sessionId = request.url.searchParams.get('session');
const runnerId = env.SESSION_RUNNER.idFromName(sessionId);
const sessionRunner = env.SESSION_RUNNER.get(runnerId);

console.log(`worker@${colo}: forwarding websocket`);

return sessionRunner.fetch(request);
}
}
...where the "session runner" DO has a fetch that boils down to:
async fetch(request: Request): Promise<Response> {
const { 0: clientSocket, 1: ourSocket } = new WebSocketPair();
ourSocket.accept();
// ...set up event handlers etc, then...
return new Response(null, { status: 101, webSocket: clientSocket });
async fetch(request: Request): Promise<Response> {
const { 0: clientSocket, 1: ourSocket } = new WebSocketPair();
ourSocket.accept();
// ...set up event handlers etc, then...
return new Response(null, { status: 101, webSocket: clientSocket });
Although the reconnections usually take of the order of 50ms, every few hours we hit periods when several synchronizers all detect a sluggish response and try to re-connect, and those reconnections are held up for a second or more before all completing at the same time. The worst cases have a delay of over 10 seconds. The logs show that almost the entire delay occurs between the worker's console message, and the subsequent GET log line for the DO.
For context: * the delays only rarely coincide with eviction and reload of DOs; generally the DOs are already active (i.e., no cold start involved). * there is no other significant traffic to our ingress or workers. How could we at least figure out where the time is going?
9 replies