Cloudflare Developers•3mo ago

~5% of our workers requests never return from Cloudflare after the worker returns

We recently migrated our production API backend onto Workers compute and now ~5% of requests never get a response. E.g. in devtools the requests remain pending forever (i.e. several minutes until I get bored), with no headers or status code received. Experiencing the issue across all browsers and client apps. When I encounter it on my machine I am able to identify these requests in workers observability logs, and they appear as successful response i.e. outcome: "ok" and $cloudflare.event.response.status: 200 but they remain hanging and never return from Cloudflare's edge. How can the response remain hanging in Cloudflare after the worker returns?? What mechanisms could be at play here? Can someone give me tips on how to debug this? Can I give some Ray IDs or worker/account IDs to the team to investigate? This ~5% failure rate on our API is causing frustration for our users and customers, so any urgent tips/ideas/support would be greatly appreciated 🙏

22 Replies

Ashkan•3mo ago

can you share your worker metric tab - error section (screenshot)? may I give your some tips.

Brett WillisOP•3mo ago

Here

Brett WillisOP•3mo ago

However, I mentioned that the hanging requests do not error. In the worker logs the worker returns successfully. So these error metrics are not showing you the requests in question.

Ashkan•3mo ago

@Brett Willis , I can give you the clue to fix it because I have encountered same problem before and I could fix it maybe your problem is the same to me. go look at your source code possibly you have some dangling promises, you return response before body is ready or data is ready double check your async/ await functions and make sure you get results or waiting for results

Brett WillisOP•3mo ago

We initially thought this might be the case too, however with dangling promises, the worker would not return. Or if the dangling promise was in a response body stream, the response headers would be returned but the body would not complete. However what I'm saying that the worker does successfully return. I.e. ends with return new Response(JSON.stringify(...), { status: 200 }) and the worker event logs show the successful outcome of the request. So something is happening to the requests in the Cloudflare network after the worker returns.

Ashkan•3mo ago

hmm 🤔 regarding to your words, I agree with you it's maybe related to CF network. the response was queued to send but it wasn't sending or got lost in the network (it's possible in unstable networks). for your specific case, you need CF team to answer you for assurance. in addition, If this 5% requests that never get responses related to analytics of your end-users. it is possible some of your users have unstable connection or bad networks. I hope I could help you maybe. 🙏 good luck

Brett WillisOP•3mo ago

The ~5% number is estimated based automated client error reporting, and health/uptime checks. I experience it several times an hour with a stable internet connection. Thanks for your effort 🙏 🙏 I hope someone from the CF team can help me out.

Iann•3mo ago

5% is quite a red mark to trigger intervention by CF if it was a general issue in the CF network imo I am more inclined to source code having some stuff hanging around too

Brett WillisOP•3mo ago

If someone can explain how code can prevent the response from reaching the client after the worker returns? I'm wondering now if perhaps it's something to do with Smart Placement (Beta). Which routes ~1% of requests differently through the network. We're using an undocumented placement hint configuration which places the worker in the specified region rather than estimating the region.

Brett WillisOP•3mo ago

It seemingly still does the "1% of requests differently" despite the placement.hint. This is suspiciously similar to the small percentage of requests that inexplicably never return seemingly randomly across any route in the application. Wonder if it's a bug in the smart placement hint. Can't blame CF when using an documented part of a beta-level feature. However disabling the placement hint would also be catastrophic to our application's performance as you could see from this chart (Smart Placement without the hint does not work at all due to sub-requests mostly being the CDN-backed Google APIs).

Walshy•3mo ago

send me a few details please: 1. account id 2. worker name 3. repro url

Brett WillisOP•3mo ago

Account ID: ced135ec1b3e0e4976dd00637c438d03 Worker name: api It seems to happen to any/all routes, but confirmed it definitely happens on our health check route which is relatively cheap to hit: https://api.hhcapp.com/_health. The normal response is 204.

Walshy•3mo ago

Thanks looking

Brett WillisOP•3mo ago

Much appreciated 🙏

Walshy•3mo ago

ok good news it isn't the rollout I expected

Brett WillisOP•3mo ago

This is our health check latency (last 48 hrs, times in PST) the spikes represent the hanging requests. So there appears be be a clearing for the past ~12hrs although this is just the small percentage chance that it hits the requests from the health checker (could still be happening to other requests).

Brett WillisOP•3mo ago

Also it's been happening for a week or two so it's not something recent. We just recently managed to capture it.

Walshy•3mo ago

I have enabled tracing for your zone for the next week - when you see this again shoot me a ray id and I can see what is going wrong (note: will need the ray id within 3 days before it falls out of retention)

Brett WillisOP•3mo ago

Perfect, thank you so much!!

Walshy•3mo ago

Just checking in, any issues today?

Brett WillisOP•3mo ago

Thanks @Walshy | Workers/Pages I haven't managed to catch a ray yet. It's still happening though... Working on it. I think I have to eat my words here! Until I started the concerted effort to get some Ray IDs I could have sworn all the requests I caught had logs with outcome: "ok"... which understandably lead me to look elsewhere than the code, hence the undocumented location hint we're using (because we've had outages related to smart placement in the past). However subsequently all the requests I caught when trying to find Ray IDs had no logs! So I tried a few things in the code and have given it a few days to observe, and it looks like the issue has been resolved that way. So my apologies @Walshy for wasting your time here, but I do thank you for your great support here as always. It's great to know you're around to help! The issue was related to promises that we cache on the global scope, and we relied on waitUntil() and <30s timeouts to ensure they didn't get left dangling. But there must be something we don't understand going on with how the global scope is persisted and restored between requests. Long story short, some requests would get a global scope with promises that, one way or another, had been left dangling and therefore they would never return. So we added timestamps and guards along with those global promises to detect or timeout if they are too old. Problem solved.

Iann•3mo ago

Yeah things were hanging around, you have to keep in mind this is no docker or container

Gaming

Programming

~5% of our workers requests never return from Cloudflare after the worker returns

Did you find this page helpful?