~5% of our workers requests never return from Cloudflare after the worker returns
We recently migrated our production API backend onto Workers compute and now ~5% of requests never get a response. E.g. in devtools the requests remain pending forever (i.e. several minutes until I get bored), with no headers or status code received. Experiencing the issue across all browsers and client apps.
When I encounter it on my machine I am able to identify these requests in workers observability logs, and they appear as successful response i.e.
outcome: "ok"
and $cloudflare.event.response.status: 200
but they remain hanging and never return from Cloudflare's edge.
How can the response remain hanging in Cloudflare after the worker returns?? What mechanisms could be at play here?
Can someone give me tips on how to debug this? Can I give some Ray IDs or worker/account IDs to the team to investigate?
This ~5% failure rate on our API is causing frustration for our users and customers, so any urgent tips/ideas/support would be greatly appreciated 🙏19 Replies
can you share your worker metric tab - error section (screenshot)? may I give your some tips.
Here
However, I mentioned that the hanging requests do not error. In the worker logs the worker returns successfully. So these error metrics are not showing you the requests in question.
@Brett Willis , I can give you the clue to fix it because I have encountered same problem before and I could fix it maybe your problem is the same to me.
go look at your source code possibly you have some dangling promises, you return response before body is ready or data is ready
double check your async/ await functions and make sure you get results or waiting for results
We initially thought this might be the case too, however with dangling promises, the worker would not return. Or if the dangling promise was in a response body stream, the response headers would be returned but the body would not complete.
However what I'm saying that the worker does successfully return. I.e. ends with
return new Response(JSON.stringify(...), { status: 200 })
and the worker event logs show the successful outcome of the request.
So something is happening to the requests in the Cloudflare network after the worker returns.hmm 🤔 regarding to your words, I agree with you it's maybe related to CF network. the response was queued to send but it wasn't sending or got lost in the network (it's possible in unstable networks). for your specific case, you need CF team to answer you for assurance.
in addition, If this 5% requests that never get responses related to analytics of your end-users. it is possible some of your users have unstable connection or bad networks.
I hope I could help you maybe. 🙏 good luck
The ~5% number is estimated based automated client error reporting, and health/uptime checks. I experience it several times an hour with a stable internet connection.
Thanks for your effort 🙏 🙏
I hope someone from the CF team can help me out.
5% is quite a red mark to trigger intervention by CF if it was a general issue in the CF network imo
I am more inclined to source code having some stuff hanging around too
If someone can explain how code can prevent the response from reaching the client after the worker returns?
I'm wondering now if perhaps it's something to do with Smart Placement (Beta).
Which routes ~1% of requests differently through the network.
We're using an undocumented placement
hint
configuration which places the worker in the specified region rather than estimating the region.It seemingly still does the "1% of requests differently" despite the
placement.hint
. This is suspiciously similar to the small percentage of requests that inexplicably never return seemingly randomly across any route in the application. Wonder if it's a bug in the smart placement hint.
Can't blame CF when using an documented part of a beta-level feature. However disabling the placement hint would also be catastrophic to our application's performance as you could see from this chart (Smart Placement without the hint does not work at all due to sub-requests mostly being the CDN-backed Google APIs).send me a few details please:
1. account id
2. worker name
3. repro url
Account ID:
ced135ec1b3e0e4976dd00637c438d03
Worker name: api
It seems to happen to any/all routes, but confirmed it definitely happens on our health check route which is relatively cheap to hit: https://api.hhcapp.com/_health
. The normal response is 204
.Thanks looking
Much appreciated 🙏
ok good news it isn't the rollout I expected
This is our health check latency (last 48 hrs, times in PST) the spikes represent the hanging requests. So there appears be be a clearing for the past ~12hrs although this is just the small percentage chance that it hits the requests from the health checker (could still be happening to other requests).
Also it's been happening for a week or two so it's not something recent. We just recently managed to capture it.
I have enabled tracing for your zone for the next week - when you see this again shoot me a ray id and I can see what is going wrong
(note: will need the ray id within 3 days before it falls out of retention)
Perfect, thank you so much!!