websocket-upgrade fetch from worker to DO randomly delayed
We are using DOs for a registry that coordinates the running of multi-user web sessions. We have "synchronizer" nodes that are external to the registry, each maintaining long-lived websockets into the registry for its housekeeping tasks.
A synchronizer watches for any of its socket connections dropping, or responding too sluggishly. In such cases, it automatically re-connects by sending a
For context: * the delays only rarely coincide with eviction and reload of DOs; generally the DOs are already active (i.e., no cold start involved). * there is no other significant traffic to our ingress or workers. How could we at least figure out where the time is going?
wss
request to our ingress worker, whose fetch
delegates to methods like this:
...where the "session runner" DO has a fetch
that boils down to:
Although the reconnections usually take of the order of 50ms, every few hours we hit periods when several synchronizers all detect a sluggish response and try to re-connect, and those reconnections are held up for a second or more before all completing at the same time. The worst cases have a delay of over 10 seconds.
The logs show that almost the entire delay occurs between the worker's console message, and the subsequent GET log line for the DO.For context: * the delays only rarely coincide with eviction and reload of DOs; generally the DOs are already active (i.e., no cold start involved). * there is no other significant traffic to our ingress or workers. How could we at least figure out where the time is going?
6 Replies
Hi there! Thanks for reaching out and apologies for the delay here.
Is it possible to get those logs sent to us? I just want to review what data we do have and try to go from there. I'm assuming it's rather difficult to replicate this and get some real time data?
We submitted a support request on July 19th, and are now awaiting feedback - including on how we might best help the support team to pinpoint what's going on.
It's not clear that the logs would be of much help. For a few hours everything is orderly - with continuous low-level chatter from the housekeeping tasks, and occasional hiccoughs (typically due to minor network stutter) resolving themselves in 100ms or so... and then we hit one of the dead patches, where everything temporarily grinds to a halt. From the perspective of our operations and the logging, these patches appear to be completely out of the blue.
It's not clear that the logs would be of much help. For a few hours everything is orderly - with continuous low-level chatter from the housekeeping tasks, and occasional hiccoughs (typically due to minor network stutter) resolving themselves in 100ms or so... and then we hit one of the dead patches, where everything temporarily grinds to a halt. From the perspective of our operations and the logging, these patches appear to be completely out of the blue.
I gotcha, makes sense. Mind sending me that ticket number in DMs? I'll take a look at the status of that!
@Nolan We received notification of the ticket being moved into the Salesforce tracker, as part of the big migration that's going on. We continue to see the destructive random delays, as before; is there some way we can work with the team to understand whether a resolution is even going to be possible?
Hey there! Yeah I'll find the ticket again! The salesfoce move has been a bit disruptive so my apologies!
Haha. I can’t access the ticket either, but @Vero 🐙 was looking into that. Thank you!
@Nolan we added the info you requested to the ticket 3 weeks ago. Can you take a look again, please?