Hey Cloudflare team! We're making great

Hey Cloudflare team! We're making great progress in migrating our code base from Jest to Vite and to new Vitest integration—we're almost done! There is however one remaining issue that we cannot figure out ourselves. It's really hard to explain for us what's going on here, but fortunately it can be reliably reproduced, so I've made a screencast. Maybe you could help us with this? Because currently it's blocking us from finishing our migration. The screencast/demo: 🍿 https://share.cleanshot.com/d914hhK8RFYpDP5N9Ldg Relevant parts from the reproduction repo: The worker definition: https://github.com/nvie/ws-promise-bug-repro/blob/main/durable-objects/src/index.ts#L1-L17 The test: https://github.com/nvie/ws-promise-bug-repro/blob/main/durable-objects/test/illustrate-bug.test.ts#L19-L34 Any help would be greatly appreciated! 🙏
27 Replies
MrBBot
MrBBot10mo ago
Woah! Thank you for the super helpful screencast. It definitely looks like the workerd runtime is crashing, and that's what's causing the ECONNREFUSED you're seeing. The next step here would be to try reproduce this using workerd only. The Vitest integration does a few weird things with Durable Objects/service bindings internally that might be causing the issue here, but it would good to rule out a workerd-bug. I should be able to look into this next week, but if you wanted to try yourself, this sample would be a good starting point: https://github.com/cloudflare/workerd/tree/main/samples/durable-objects-chat Actually, could you try something like this...
// ...
expect(ws).toBeDefined();

const messagePromise = waitForMessage(ws);
ws.accept();

ws.send("hello");

const event = await messagePromise;
// ...
// ...
expect(ws).toBeDefined();

const messagePromise = waitForMessage(ws);
ws.accept();

ws.send("hello");

const event = await messagePromise;
// ...
Essentially, move the addEventListener() before the accept()
Vincent
VincentOP10mo ago
I should have mentioned this in the screencast, but that's actually something we tried too indeed. It makes no difference: - Still works when sending only one message - Still crashes when sending two messages - Still works when you put await wait(1) somewhere - The only difference it makes is that the test will no longer timeout when you move the wait after the second message but before the waitForMessage call. It actually makes sense in that case! Unfortunately, I don't have the time to try a setup against workerd directly (and it's a little too far out of my comfort zone tbh), but it sounds like a good plan to me. Really curious if you can figure out what's going on here.
MrBBot
MrBBot10mo ago
Ok no worries. If you can open a GitHub issue with your original message, I can add this to my list. 👍 One other quick thing to try, does adding webSocketError() and webSocketClose() handlers to your Demo object change anything?
Vincent
VincentOP10mo ago
Just empty ones?
MrBBot
MrBBot10mo ago
Maybe __console.log(arguments) within them just to see if they're called with anything useful? (__console is just console without Vitest patching)
Vincent
VincentOP10mo ago
Just added them (also added it to the webSocketMessage one), it makes no behavioral difference in any of the variants. Only the webSocketMessage one gets called, the other two don't. The final ws.close() will does not trigger it. However… 😬
No description
No description
No description
MrBBot
MrBBot10mo ago
Ok yeah, that's super weird 😅 my guess is this is related to us aborting all Durable Objects after all tests (and after each test when isolated storage is enabled). I just checked, and that ECONNREFUSED is coming from this fetch() call: https://github.com/cloudflare/workers-sdk/blob/1af469af63616a60c88752e1e170cfa59d41b1cd/packages/vitest-pool-workers/src/pool/loopback.ts#L203. I wonder if you do that await runInDurableObject(stub, (_, state) => state.blockConcurrencyWhile(() => { throw new Error() })); trick at the end of your test to abort the object yourself it fixes things?
Vincent
VincentOP10mo ago
I think it works! I can no longer crash it it seems like. However… I don't think I understand how it works.
MrBBot
MrBBot10mo ago
Yeah, if you replace the runInDurableObject(...) with import unsafe from "workerd:unsafe"; await unsafe.abortAllDurableObjects() does that still work?
Vincent
VincentOP10mo ago
Nope! That still crashes it.
MrBBot
MrBBot10mo ago
Very interesting... ok, so this is some sort of bug in abortAllDurableObjects() 🤔 But it sounds like this might've unblocked you?
Vincent
VincentOP10mo ago
No description
Vincent
VincentOP10mo ago
Maybe yeah, I'll try this approach in our actual tests soon, but this is looking promising. I'm still not sure how this fix works exactly. At the end of the test, abortAllDurableObjects() is called, but if there is an outstanding (or "firing"?) promise somehow, it crashes? And the way to work around that is by crashing the DO ourselves before the pool worker does?
MrBBot
MrBBot10mo ago
Yeah, my guess is that abortAllDurableObjects() isn't cleaning up Durable Objects properly (i.e. it doesn't stop all in-progress WebSocket messages), and the 2nd ws.send() ends up accessing something that's been freed somehow.
Vincent
VincentOP10mo ago
Interestingly, doing a afterEach(() => wait(1)) will not solve this issue for us, which would have been the lowest-cost workaround. Will the test runner call abortAllDurableObjects() after the test is run, but before the "afterEach" callback?
MrBBot
MrBBot10mo ago
Hmmm, yeah it looks like that's the case 😦 That doesn't feel right 🤔
Vincent
VincentOP10mo ago
I tried moving the crash-ourselves-trick into a afterEach, but that unfortunately doesn't work either.
MrBBot
MrBBot10mo ago
This is because we reset storage after each retry, but the onAfterTryTask() hook fires before afterEach()
Vincent
VincentOP10mo ago
No worries, I can make a wrapper for the tests.
MrBBot
MrBBot10mo ago
If you could still create a GitHub issue for this though that would be great. 🙂
Vincent
VincentOP10mo ago
Just did, see https://github.com/cloudflare/workers-sdk/issues/5439 ! Thank you so much for the quick and amazing help, @MrBBot ! Super appreciate you taking the time and finding a workaround with us! Your tech support is top-notch. 🙏
GitHub
🐛 BUG: workerd runtime can crash in some cases where WebSockets and...
Which Cloudflare product(s) does this pertain to? Workers Runtime, Workers Vitest Integration What version(s) of the tool(s) are you using? @cloudflare/[email protected], [email protected]...
Vincent
VincentOP10mo ago
Just tried applying the workaround trick to our real test suite, and unfortunately we're still seeing the crashes there 😢 So manually crashing the durable object maybe it's not a complete workaround for us after all yet. It crashes in the same way after the test ends (so still crashing even after the await runInDurableObject(... /* crash trick */)).
Unknown User
Unknown User10mo ago
Message Not Public
Sign In & Join Server To View
Vincent
VincentOP10mo ago
Small update after playing more with this today: we're still hitting this error quite a lot, while converting our WebSocket based unit tests, beyond just the onAfterTryTask scope I think. It looks like the Vitest pool worker aborting the durable object is just one way to trigger this Fatal uncaught kj::Exception: kj/async.c++:2164: failed: expected !firing; Promise callback destroyed itself. issue, but not the only one. We've been seeing this issue also pop up mid-test, i.e.
__console.log('here 1');
await something(); // Crash here
__console.log('here 2');
__console.log('here 1');
await something(); // Crash here
__console.log('here 2');
When those crashes happen, we're not seeing the here 2 appear in the console, so this means things crash before onAfterTryTask even has a chance to run. I think the root cause of this error lies deeper than inside Vitest pool worker. Maybe it's more likely a bug inside workerd itself?
MrBBot
MrBBot10mo ago
Maybe it's more likely a bug inside workerd itself?
Yeah, that seems likely. I'm actually going to be leaving Cloudflare tomorrow so probably won't have time to personally look into this. GitHub issues are probably the best way to track this though.
MrBBot
MrBBot10mo ago
There's this open issue on workerd about segfaults with hibernation, I suspect this is related: https://github.com/cloudflare/workerd/issues/1422
GitHub
🐛 wrangler dev error: Received signal #11: Segmentation fault: 11 ·...
Hello! Recently we've been seeing this issue with one of our worker scripts (we have three, the other two seem fine) where after some time being used in dev mode it will segfault and become unr...
Vincent
VincentOP10mo ago
Oh noes, you will be missed, @MrBBot! Thanks for all the awesome work you did on the Vitest integration, and for all the recent help. I wish you all the best with whatever is next for you ❤️

Did you find this page helpful?