Extremely curious transaction/timeout errors starting on 3/17

Hello, starting on March 17th, we noticed a reproducible durable object timeout on a durable object when deleting items from storage. We have a fix, but we do not understand why our fix works or how this failure started happening. The error only happens on specific durable objects, and not others, even with the exact same data stored. Here are reproduction steps (but again, it only fails on specific durable objects) 1. Client request to a worker -> fetch durable object -> durableObjectStorage.list() ~ 37,000 values, totaling around 4MB. This response works normally, returns 200. 2. Immediately make another client request to the same worker -> fetch to durable object -> durableObjectStorage.list(), same 37,000 values-> durableObjectStorage.delete(chunks), returns 500 timeout Without the initial get request, it works. So our theory is something leaked from the initial get internal to durable objects. We had no recent deploys before this started occurring. We tried with allowUnconfirmed: true and the error went away, but obviously the storage didn’t actually get deleted. Then we tried with allowUnconfirmed: true with a sync() after all deletes. Then the error changed to “Transaction failed due to conflict”. We searched the workerd sourcecode, but could don’t find this error. We could not find this error in google either. We were testing in isolation with no other incoming requests and no other storage operations besides the deletes. I'll post the fix in a reply, as the messages is too long for discord. While we have a fix, we are a bit troubled for the following reasons: 1. We had not deployed anything for 5 days before this started happening 2. The exact same data works on some durable objects, but not others… this is our biggest source of confusion 3. It ONLY fails when called directly after a previous successful call.
1 Reply
JonathanR
JonathanR8mo ago
During our investigation we tried many fixes, but the one that worked was this change:
// FAILS -> DurableObjectTimeout
for (const chunk of chunked(allKeysToDelete, 128)) {
await this.dos.delete(chunk);
}

// Also FAILS -> Transaction failed due to conflict
for (const chunk of chunked(allKeysToDelete, 128)) {
await this.dos.delete(chunk, { allowUnconfirmed: true });
}
await this.dos.sync();

// WORKS, no errors.
for (const chunk of chunked(allKeysToDelete, 128)) {
await this.dos.delete(chunk, { allowUnconfirmed: true });
await this.dos.sync();
}
// FAILS -> DurableObjectTimeout
for (const chunk of chunked(allKeysToDelete, 128)) {
await this.dos.delete(chunk);
}

// Also FAILS -> Transaction failed due to conflict
for (const chunk of chunked(allKeysToDelete, 128)) {
await this.dos.delete(chunk, { allowUnconfirmed: true });
}
await this.dos.sync();

// WORKS, no errors.
for (const chunk of chunked(allKeysToDelete, 128)) {
await this.dos.delete(chunk, { allowUnconfirmed: true });
await this.dos.sync();
}
Want results from more Discord servers?
Add your server