~5% of our workers requests never return from Cloudflare after the worker returns

We recently migrated our production API backend onto Workers compute and now ~5% of requests never get a response. E.g. in devtools the requests remain pending forever (i.e. several minutes until I get bored), with no headers or status code received. Experiencing the issue across all browsers and client apps. When I encounter it on my machine I am able to identify these requests in workers observability logs, and they appear as successful response i.e. outcome: "ok" and $cloudflare.event.response.status: 200 but they remain hanging and never return from Cloudflare's edge. How can the response remain hanging in Cloudflare after the worker returns?? What mechanisms could be at play here? Can someone give me tips on how to debug this? Can I give some Ray IDs or worker/account IDs to the team to investigate? This ~5% failure rate on our API is causing frustration for our users and customers, so any urgent tips/ideas/support would be greatly appreciated 🙏
19 Replies
Ashkan
Ashkan21h ago
can you share your worker metric tab - error section (screenshot)? may I give your some tips.
Brett Willis
Brett WillisOP21h ago
Here
No description
Brett Willis
Brett WillisOP21h ago
However, I mentioned that the hanging requests do not error. In the worker logs the worker returns successfully. So these error metrics are not showing you the requests in question.
Ashkan
Ashkan21h ago
@Brett Willis , I can give you the clue to fix it because I have encountered same problem before and I could fix it maybe your problem is the same to me. go look at your source code possibly you have some dangling promises, you return response before body is ready or data is ready double check your async/ await functions and make sure you get results or waiting for results
Brett Willis
Brett WillisOP20h ago
We initially thought this might be the case too, however with dangling promises, the worker would not return. Or if the dangling promise was in a response body stream, the response headers would be returned but the body would not complete. However what I'm saying that the worker does successfully return. I.e. ends with return new Response(JSON.stringify(...), { status: 200 }) and the worker event logs show the successful outcome of the request. So something is happening to the requests in the Cloudflare network after the worker returns.
No description
Ashkan
Ashkan20h ago
hmm 🤔 regarding to your words, I agree with you it's maybe related to CF network. the response was queued to send but it wasn't sending or got lost in the network (it's possible in unstable networks). for your specific case, you need CF team to answer you for assurance. in addition, If this 5% requests that never get responses related to analytics of your end-users. it is possible some of your users have unstable connection or bad networks. I hope I could help you maybe. 🙏 good luck
Brett Willis
Brett WillisOP20h ago
The ~5% number is estimated based automated client error reporting, and health/uptime checks. I experience it several times an hour with a stable internet connection. Thanks for your effort 🙏 🙏 I hope someone from the CF team can help me out.
Iann
Iann7h ago
5% is quite a red mark to trigger intervention by CF if it was a general issue in the CF network imo I am more inclined to source code having some stuff hanging around too
Brett Willis
Brett WillisOP6h ago
If someone can explain how code can prevent the response from reaching the client after the worker returns? I'm wondering now if perhaps it's something to do with Smart Placement (Beta). Which routes ~1% of requests differently through the network. We're using an undocumented placement hint configuration which places the worker in the specified region rather than estimating the region.
Brett Willis
Brett WillisOP5h ago
It seemingly still does the "1% of requests differently" despite the placement.hint. This is suspiciously similar to the small percentage of requests that inexplicably never return seemingly randomly across any route in the application. Wonder if it's a bug in the smart placement hint. Can't blame CF when using an documented part of a beta-level feature. However disabling the placement hint would also be catastrophic to our application's performance as you could see from this chart (Smart Placement without the hint does not work at all due to sub-requests mostly being the CDN-backed Google APIs).
No description
Walshy
Walshy5h ago
send me a few details please: 1. account id 2. worker name 3. repro url
Brett Willis
Brett WillisOP5h ago
Account ID: ced135ec1b3e0e4976dd00637c438d03 Worker name: api It seems to happen to any/all routes, but confirmed it definitely happens on our health check route which is relatively cheap to hit: https://api.hhcapp.com/_health. The normal response is 204.
Walshy
Walshy5h ago
Thanks looking
Brett Willis
Brett WillisOP5h ago
Much appreciated 🙏
Walshy
Walshy5h ago
ok good news it isn't the rollout I expected
Brett Willis
Brett WillisOP4h ago
This is our health check latency (last 48 hrs, times in PST) the spikes represent the hanging requests. So there appears be be a clearing for the past ~12hrs although this is just the small percentage chance that it hits the requests from the health checker (could still be happening to other requests).
No description
Brett Willis
Brett WillisOP4h ago
Also it's been happening for a week or two so it's not something recent. We just recently managed to capture it.
Walshy
Walshy4h ago
I have enabled tracing for your zone for the next week - when you see this again shoot me a ray id and I can see what is going wrong (note: will need the ray id within 3 days before it falls out of retention)
Brett Willis
Brett WillisOP4h ago
Perfect, thank you so much!!

Did you find this page helpful?