Cloudflare Developers•13mo ago

Is there a workaround that will let me

Is there a workaround that will let me test a Durable Object waking up from hibernation at all? Or is this currently impossible to test? Asking, because I'm trying to simulate that event by waiting for 10 seconds, but every time I do, I hit a segfault… 😬 I made a minimal replication in this repo (on the hibernate branch): https://github.com/nvie/ws-promise-bug-repro/blob/hibernate/durable-objects/test/illustrate-bug.test.ts#L22-L24 It's related to my feature request from last week for an API to test a Durable Object waking up from hibernation (see https://github.com/cloudflare/workers-sdk/issues/5423)

5 Replies

VincentOP•13mo ago

Do you have a tip for me maybe, @MrBBot ?

MrBBot•13mo ago

Hey! 👋 I think this might be impossible to test unfortunately. 😦 Thanks for the reproduction though. Will see if I can get that looked at. 👍

VincentOP•13mo ago

FWIW, I'm seeing these segfaults as well when running our workers locally using wrangler dev. So it's not limited to just Vitest pool workers. The only thing we do is keeping a socket open for a while and just wait long enough. The segfaults seem to be happening out of the blue — haven't found a reliable way to replicate it (other than triggering it from Vitest). Most of the time, a console log appears, but sometimes the worker sockets become unavailable "silently". It then behaves exactly like in the segfault case, except there is no log for it 🤔

VincentOP•13mo ago

In production, we're seeing another strange effect which may be related to this. We set up a simple ping/pong auto-responder, but after waking up from hibernation — when deciding which sockets to restore and which to close — we're seeing some calls to state.getWebSocketAutoResponseTimestamp() return a date that's too old, where we have definitely received a more recent "pong" response from the server. Is this a known issue? Or might it be that the workers are segfaulting similarly in production, and hence not being able to update the last auto response correctly somehow?

VincentOP•13mo ago

Here I managed to capture such a segfault happening on video: 👀 https://share.cleanshot.com/WF23n8kjy8t6Wdmlgr4N Video segments: 0:00 – Basic normal app demonstration 0:20 – Wait to trigger hibernation 0:30 – Show that DO wakes up from hibernation and restores the socket correctly 0:43 – Wait to trigger hibernation again 0:55 – Demo that it worked as expected again 1:01 – Start waiting for a longer period of time until the segfault happens 1:36 – A normal ping/pong happens (see the logs @ 1:40, and the network tab @ 1:52) 2:05 – Another normal ping/pong happens 2:25 – 💥 CRASH 💥 At this point workerd becomes completely unavailable In this demo, there are no DO alarms or any other DO complications. It's basically a simple websocket setup with hibernation and "just" waiting for the crash to happen. Will see if I can find the time to set up a more minimal replication to rule out any Liveblocks specific complexitites. From what it looks like, it will happen eventually on any hibernated DO if you just wait long enough. The exact timing varies.

CleanShot Cloud

workerd-crash

Video uploaded to CleanShot Cloud

Gaming

Programming

Is there a workaround that will let me

Did you find this page helpful?