Cloudflare Developers•13mo ago

Hey Cloudflare team! We're making great

Hey Cloudflare team! We're making great progress in migrating our code base from Jest to Vite and to new Vitest integration—we're almost done! There is however one remaining issue that we cannot figure out ourselves. It's really hard to explain for us what's going on here, but fortunately it can be reliably reproduced, so I've made a screencast. Maybe you could help us with this? Because currently it's blocking us from finishing our migration. The screencast/demo: 🍿 https://share.cleanshot.com/d914hhK8RFYpDP5N9Ldg Relevant parts from the reproduction repo: The worker definition: https://github.com/nvie/ws-promise-bug-repro/blob/main/durable-objects/src/index.ts#L1-L17 The test: https://github.com/nvie/ws-promise-bug-repro/blob/main/durable-objects/test/illustrate-bug.test.ts#L19-L34 Any help would be greatly appreciated! 🙏

27 Replies

MrBBot•13mo ago

Woah! Thank you for the super helpful screencast. It definitely looks like the workerd runtime is crashing, and that's what's causing the ECONNREFUSED you're seeing. The next step here would be to try reproduce this using workerd only. The Vitest integration does a few weird things with Durable Objects/service bindings internally that might be causing the issue here, but it would good to rule out a workerd-bug. I should be able to look into this next week, but if you wanted to try yourself, this sample would be a good starting point: https://github.com/cloudflare/workerd/tree/main/samples/durable-objects-chat Actually, could you try something like this...

// ...
expect(ws).toBeDefined();

const messagePromise = waitForMessage(ws);
ws.accept();

ws.send("hello");

const event = await messagePromise;
// ...

// ...
expect(ws).toBeDefined();

const messagePromise = waitForMessage(ws);
ws.accept();

ws.send("hello");

const event = await messagePromise;
// ...

Essentially, move the addEventListener() before the accept()

VincentOP•13mo ago

I should have mentioned this in the screencast, but that's actually something we tried too indeed. It makes no difference: - Still works when sending only one message - Still crashes when sending two messages - Still works when you put await wait(1) somewhere - The only difference it makes is that the test will no longer timeout when you move the wait after the second message but before the waitForMessage call. It actually makes sense in that case! Unfortunately, I don't have the time to try a setup against workerd directly (and it's a little too far out of my comfort zone tbh), but it sounds like a good plan to me. Really curious if you can figure out what's going on here.

MrBBot•13mo ago

Ok no worries. If you can open a GitHub issue with your original message, I can add this to my list. 👍 One other quick thing to try, does adding webSocketError() and webSocketClose() handlers to your Demo object change anything?

VincentOP•13mo ago

Just empty ones?

MrBBot•13mo ago

Maybe __console.log(arguments) within them just to see if they're called with anything useful? (__console is just console without Vitest patching)

VincentOP•13mo ago

Just added them (also added it to the webSocketMessage one), it makes no behavioral difference in any of the variants. Only the webSocketMessage one gets called, the other two don't. The final ws.close() will does not trigger it. However… 😬

MrBBot•13mo ago

Ok yeah, that's super weird 😅 my guess is this is related to us aborting all Durable Objects after all tests (and after each test when isolated storage is enabled). I just checked, and that ECONNREFUSED is coming from this fetch() call: https://github.com/cloudflare/workers-sdk/blob/1af469af63616a60c88752e1e170cfa59d41b1cd/packages/vitest-pool-workers/src/pool/loopback.ts#L203. I wonder if you do that await runInDurableObject(stub, (_, state) => state.blockConcurrencyWhile(() => { throw new Error() })); trick at the end of your test to abort the object yourself it fixes things?

VincentOP•13mo ago

I think it works! I can no longer crash it it seems like. However… I don't think I understand how it works.

MrBBot•13mo ago

Yeah, if you replace the runInDurableObject(...) with import unsafe from "workerd:unsafe"; await unsafe.abortAllDurableObjects() does that still work?

VincentOP•13mo ago

Nope! That still crashes it.

MrBBot•13mo ago

Very interesting... ok, so this is some sort of bug in abortAllDurableObjects() 🤔 But it sounds like this might've unblocked you?

VincentOP•13mo ago

Maybe yeah, I'll try this approach in our actual tests soon, but this is looking promising. I'm still not sure how this fix works exactly. At the end of the test, abortAllDurableObjects() is called, but if there is an outstanding (or "firing"?) promise somehow, it crashes? And the way to work around that is by crashing the DO ourselves before the pool worker does?

MrBBot•13mo ago

Yeah, my guess is that abortAllDurableObjects() isn't cleaning up Durable Objects properly (i.e. it doesn't stop all in-progress WebSocket messages), and the 2nd ws.send() ends up accessing something that's been freed somehow.

VincentOP•13mo ago

Interestingly, doing a afterEach(() => wait(1)) will not solve this issue for us, which would have been the lowest-cost workaround. Will the test runner call abortAllDurableObjects() after the test is run, but before the "afterEach" callback?

MrBBot•13mo ago

Hmmm, yeah it looks like that's the case 😦 That doesn't feel right 🤔

VincentOP•13mo ago

I tried moving the crash-ourselves-trick into a afterEach, but that unfortunately doesn't work either.

MrBBot•13mo ago

This is because we reset storage after each retry, but the onAfterTryTask() hook fires before afterEach()

VincentOP•13mo ago

No worries, I can make a wrapper for the tests.

MrBBot•13mo ago

If you could still create a GitHub issue for this though that would be great. 🙂

VincentOP•13mo ago

Just did, see https://github.com/cloudflare/workers-sdk/issues/5439 ! Thank you so much for the quick and amazing help, @MrBBot ! Super appreciate you taking the time and finding a workaround with us! Your tech support is top-notch. 🙏

GitHub

🐛 BUG: workerd runtime can crash in some cases where WebSockets and...

Which Cloudflare product(s) does this pertain to? Workers Runtime, Workers Vitest Integration What version(s) of the tool(s) are you using? @cloudflare/[email protected], [email protected]...

VincentOP•13mo ago

Just tried applying the workaround trick to our real test suite, and unfortunately we're still seeing the crashes there 😢 So manually crashing the durable object maybe it's not a complete workaround for us after all yet. It crashes in the same way after the test ends (so still crashing even after the await runInDurableObject(... /* crash trick */)).

Unknown User•13mo ago

Message Not Public

VincentOP•13mo ago

Small update after playing more with this today: we're still hitting this error quite a lot, while converting our WebSocket based unit tests, beyond just the onAfterTryTask scope I think. It looks like the Vitest pool worker aborting the durable object is just one way to trigger this Fatal uncaught kj::Exception: kj/async.c++:2164: failed: expected !firing; Promise callback destroyed itself. issue, but not the only one. We've been seeing this issue also pop up mid-test, i.e.

__console.log('here 1');
await something();  // Crash here
__console.log('here 2');

__console.log('here 1');
await something();  // Crash here
__console.log('here 2');

When those crashes happen, we're not seeing the here 2 appear in the console, so this means things crash before onAfterTryTask even has a chance to run. I think the root cause of this error lies deeper than inside Vitest pool worker. Maybe it's more likely a bug inside workerd itself?

MrBBot•13mo ago

Maybe it's more likely a bug inside workerd itself?

Yeah, that seems likely. I'm actually going to be leaving Cloudflare tomorrow so probably won't have time to personally look into this. GitHub issues are probably the best way to track this though.

MrBBot•13mo ago

There's this open issue on workerd about segfaults with hibernation, I suspect this is related: https://github.com/cloudflare/workerd/issues/1422

GitHub

🐛 wrangler dev error: Received signal #11: Segmentation fault: 11 ·...

Hello! Recently we've been seeing this issue with one of our worker scripts (we have three, the other two seem fine) where after some time being used in dev mode it will segfault and become unr...

VincentOP•13mo ago

Oh noes, you will be missed, @MrBBot! Thanks for all the awesome work you did on the Vitest integration, and for all the recent help. I wish you all the best with whatever is next for you ❤️

Gaming

Programming

Hey Cloudflare team! We're making great

Did you find this page helpful?