Is there a workaround that will let me
Is there a workaround that will let me test a Durable Object waking up from hibernation at all? Or is this currently impossible to test?
Asking, because I'm trying to simulate that event by waiting for 10 seconds, but every time I do, I hit a segfault… 😬
I made a minimal replication in this repo (on the hibernate branch): https://github.com/nvie/ws-promise-bug-repro/blob/hibernate/durable-objects/test/illustrate-bug.test.ts#L22-L24
It's related to my feature request from last week for an API to test a Durable Object waking up from hibernation (see https://github.com/cloudflare/workers-sdk/issues/5423)
5 Replies
Do you have a tip for me maybe, @MrBBot ?
Hey! 👋 I think this might be impossible to test unfortunately. 😦 Thanks for the reproduction though. Will see if I can get that looked at. 👍
FWIW, I'm seeing these segfaults as well when running our workers locally using
wrangler dev
. So it's not limited to just Vitest pool workers. The only thing we do is keeping a socket open for a while and just wait long enough. The segfaults seem to be happening out of the blue — haven't found a reliable way to replicate it (other than triggering it from Vitest).
Most of the time, a console log appears, but sometimes the worker sockets become unavailable "silently". It then behaves exactly like in the segfault case, except there is no log for it 🤔In production, we're seeing another strange effect which may be related to this. We set up a simple ping/pong auto-responder, but after waking up from hibernation — when deciding which sockets to restore and which to close — we're seeing some calls to
state.getWebSocketAutoResponseTimestamp()
return a date that's too old, where we have definitely received a more recent "pong" response from the server.
Is this a known issue? Or might it be that the workers are segfaulting similarly in production, and hence not being able to update the last auto response correctly somehow?Here I managed to capture such a segfault happening on video:
👀 https://share.cleanshot.com/WF23n8kjy8t6Wdmlgr4N
Video segments:
0:00 – Basic normal app demonstration
0:20 – Wait to trigger hibernation
0:30 – Show that DO wakes up from hibernation and restores the socket correctly
0:43 – Wait to trigger hibernation again
0:55 – Demo that it worked as expected again
1:01 – Start waiting for a longer period of time until the segfault happens
1:36 – A normal ping/pong happens (see the logs @ 1:40, and the network tab @ 1:52)
2:05 – Another normal ping/pong happens
2:25 – 💥 CRASH 💥 At this point
workerd
becomes completely unavailable
In this demo, there are no DO alarms or any other DO complications. It's basically a simple websocket setup with hibernation and "just" waiting for the crash to happen.
Will see if I can find the time to set up a more minimal replication to rule out any Liveblocks specific complexitites. From what it looks like, it will happen eventually on any hibernated DO if you just wait long enough. The exact timing varies.