Hey Cloudflare team! We're making great
Hey Cloudflare team! We're making great progress in migrating our code base from Jest to Vite and to new Vitest integration—we're almost done! There is however one remaining issue that we cannot figure out ourselves. It's really hard to explain for us what's going on here, but fortunately it can be reliably reproduced, so I've made a screencast. Maybe you could help us with this? Because currently it's blocking us from finishing our migration.
The screencast/demo:
🍿 https://share.cleanshot.com/d914hhK8RFYpDP5N9Ldg
Relevant parts from the reproduction repo:
The worker definition: https://github.com/nvie/ws-promise-bug-repro/blob/main/durable-objects/src/index.ts#L1-L17
The test: https://github.com/nvie/ws-promise-bug-repro/blob/main/durable-objects/test/illustrate-bug.test.ts#L19-L34
Any help would be greatly appreciated! 🙏
27 Replies
Woah! Thank you for the super helpful screencast. It definitely looks like the
workerd
runtime is crashing, and that's what's causing the ECONNREFUSED
you're seeing. The next step here would be to try reproduce this using workerd
only. The Vitest integration does a few weird things with Durable Objects/service bindings internally that might be causing the issue here, but it would good to rule out a workerd
-bug.
I should be able to look into this next week, but if you wanted to try yourself, this sample would be a good starting point: https://github.com/cloudflare/workerd/tree/main/samples/durable-objects-chat
Actually, could you try something like this...
Essentially, move the addEventListener()
before the accept()
I should have mentioned this in the screencast, but that's actually something we tried too indeed. It makes no difference:
- Still works when sending only one message
- Still crashes when sending two messages
- Still works when you put
await wait(1)
somewhere
- The only difference it makes is that the test will no longer timeout when you move the wait after the second message but before the waitForMessage
call. It actually makes sense in that case!
Unfortunately, I don't have the time to try a setup against workerd
directly (and it's a little too far out of my comfort zone tbh), but it sounds like a good plan to me. Really curious if you can figure out what's going on here.Ok no worries. If you can open a GitHub issue with your original message, I can add this to my list. 👍 One other quick thing to try, does adding
webSocketError()
and webSocketClose()
handlers to your Demo
object change anything?Just empty ones?
Maybe
__console.log(arguments)
within them just to see if they're called with anything useful?
(__console
is just console
without Vitest patching)Just added them (also added it to the
webSocketMessage
one), it makes no behavioral difference in any of the variants. Only the webSocketMessage
one gets called, the other two don't.
The final ws.close()
will does not trigger it. However… 😬Ok yeah, that's super weird 😅 my guess is this is related to us aborting all Durable Objects after all tests (and after each test when isolated storage is enabled). I just checked, and that
ECONNREFUSED
is coming from this fetch()
call: https://github.com/cloudflare/workers-sdk/blob/1af469af63616a60c88752e1e170cfa59d41b1cd/packages/vitest-pool-workers/src/pool/loopback.ts#L203. I wonder if you do that await runInDurableObject(stub, (_, state) => state.blockConcurrencyWhile(() => { throw new Error() }));
trick at the end of your test to abort the object yourself it fixes things?I think it works! I can no longer crash it it seems like.
However… I don't think I understand how it works.
Yeah, if you replace the
runInDurableObject(...)
with import unsafe from "workerd:unsafe"; await unsafe.abortAllDurableObjects()
does that still work?Nope! That still crashes it.
Very interesting... ok, so this is some sort of bug in
abortAllDurableObjects()
🤔
But it sounds like this might've unblocked you?Maybe yeah, I'll try this approach in our actual tests soon, but this is looking promising. I'm still not sure how this fix works exactly. At the end of the test,
abortAllDurableObjects()
is called, but if there is an outstanding (or "firing"?) promise somehow, it crashes? And the way to work around that is by crashing the DO ourselves before the pool worker does?Yeah, my guess is that
abortAllDurableObjects()
isn't cleaning up Durable Objects properly (i.e. it doesn't stop all in-progress WebSocket messages), and the 2nd ws.send()
ends up accessing something that's been freed somehow.Interestingly, doing a
afterEach(() => wait(1))
will not solve this issue for us, which would have been the lowest-cost workaround. Will the test runner call abortAllDurableObjects()
after the test is run, but before the "afterEach" callback?Hmmm, yeah it looks like that's the case 😦
That doesn't feel right 🤔
I tried moving the crash-ourselves-trick into a
afterEach
, but that unfortunately doesn't work either.This is because we reset storage after each retry, but the
onAfterTryTask()
hook fires before afterEach()
No worries, I can make a wrapper for the tests.
If you could still create a GitHub issue for this though that would be great. 🙂
Just did, see https://github.com/cloudflare/workers-sdk/issues/5439 ! Thank you so much for the quick and amazing help, @MrBBot ! Super appreciate you taking the time and finding a workaround with us! Your tech support is top-notch. 🙏
GitHub
🐛 BUG: workerd runtime can crash in some cases where WebSockets and...
Which Cloudflare product(s) does this pertain to? Workers Runtime, Workers Vitest Integration What version(s) of the tool(s) are you using? @cloudflare/[email protected], [email protected]...
Just tried applying the workaround trick to our real test suite, and unfortunately we're still seeing the crashes there 😢 So manually crashing the durable object maybe it's not a complete workaround for us after all yet.
It crashes in the same way after the test ends (so still crashing even after the
await runInDurableObject(... /* crash trick */)
).Unknown User•10mo ago
Message Not Public
Sign In & Join Server To View
Small update after playing more with this today: we're still hitting this error quite a lot, while converting our WebSocket based unit tests, beyond just the
onAfterTryTask
scope I think. It looks like the Vitest pool worker aborting the durable object is just one way to trigger this Fatal uncaught kj::Exception: kj/async.c++:2164: failed: expected !firing; Promise callback destroyed itself.
issue, but not the only one. We've been seeing this issue also pop up mid-test, i.e.
When those crashes happen, we're not seeing the here 2
appear in the console, so this means things crash before onAfterTryTask
even has a chance to run. I think the root cause of this error lies deeper than inside Vitest pool worker. Maybe it's more likely a bug inside workerd
itself?Maybe it's more likely a bug inside workerd
itself?
Yeah, that seems likely. I'm actually going to be leaving Cloudflare tomorrow so probably won't have time to personally look into this. GitHub issues are probably the best way to track this though.There's this open issue on
workerd
about segfaults with hibernation, I suspect this is related: https://github.com/cloudflare/workerd/issues/1422GitHub
🐛 wrangler dev error: Received signal #11: Segmentation fault: 11 ·...
Hello! Recently we've been seeing this issue with one of our worker scripts (we have three, the other two seem fine) where after some time being used in dev mode it will segfault and become unr...
Oh noes, you will be missed, @MrBBot! Thanks for all the awesome work you did on the Vitest integration, and for all the recent help. I wish you all the best with whatever is next for you ❤️