Intermittent slowness on one specific ISP using Worker as Origin (proxy) / maybe all Cloudflare
Hi - we've got two reports now (myself being one) of https://bgp.tools/as/7303 working surprisingly slow intermittently when users browse through our sites with that specific ISP (and no reports yet on any other ISPs/regions). This has been the case for a few weeks now – not a recent event.
In general we've replicated it mostly over our sites (hosted on Pages, we have a Worker as origin which is a very simple proxy that just passes through the request and assigns a header with the hostname for routing purposes).
Here's a video showing more details https://www.loom.com/share/26eaa5343cf54fa2a66b36539e795206.
Things I've tried:
- Using 1.1.1.1 as resolver or not (said WARP but it wasn't enabled) – same result
- Proxying against our own zone vs *.pages.dev – same result (so not zone-specific settings)
- Not related to any Cache HIT/MISS. F.e. for media we use Cache API and send a server-timing header (which actually also includes the worker startup time) – you can see the time wasn't spent on any app logic of ours
- When things work as expected, most of those requests that took seconds resolve in XX/XXX ms.
- Not seeing any security events (e.g. thought we could be triggering some) when proxying against our zone
- Not seeing anything abnormal doing traceroute / ping to either our site, cloudflare.com or 1.1.1.1 (example in the video)
- Not seeing related errors nor traces anywhere (a few Client Disconnected on the proxy which could be related but also could be normal)
- Not seeing anything odd when opening speed.cloudflare.com (e.g. 0% packet loss)
What'd be recommended ways to debug or understand this issue deeper? Is there any known security mechanism that could make requests stuck as opposed as to 429 / challenge / etc? We're funneling quite a few requests to the same worker but it's still odd to see it only happening with one specific ISP so far.
Has the time come where I'll need to learn how to use Wireshark? 😄
Thanks!
8 Replies
Is there any known security mechanism that could make requests stuck as opposed as to 429 / challenge / etc?Connections open longer is more expensive, less expensive to just insta-block and more on. I'd run an MTR for a bit while the issue is happening and look for spikes the entire way to the end starting at a specific hop. Or for the path to change/not be consistent How Intermittently is this? I assume it's just Cloudflare and not anything else? If you're in Argentina are you near Buenos Aires/which CF PoP are you connecting to? The one in Buenos Aires is shown as partially routed curiously, not sure when that started
hey Chaika, thanks for the response, hope you're doing well!
1. I do connect to EZE (the CF PoP), I've occassionally seen SCL in the past – will keep an eye out if the issue is always with EZE
2. I've only been able to see it with our own sites and likely propagating to all Cloudflare (at least the CF dashboard can be perceived much slower/getting stuck, it's been hard to 100% confirm this as it could be that my own network is congested at that time). It sometimes feels like we can bruteforce it (e.g. refreshing 5-10 times with pragma: no cache) but also doing that on another network / or in the troublesome network most of the time just works as expected
3. Great shout on
mtr
– I'm seeing a few hops with a high % of packet loss even when the site is currently working as expected on the troublesome network (you can see the timings of XX ms instead of X seconds like on the video). On my other ISP, I see all IPv6 addresses with persistent 0% loss.
I'll see how mtr looks when the sites are not working as expected – LMK if the current output already seems conclusivePacket loss or latency on a specific hop that doesn't go all the way to the end doesn't matter/it's just the router(s) being silly.
If it was something wrong with your connectivity to Cloudflare, I would expect when it happens, starting with one of the hops which is the problem hop, you'd see latency or packet loss spike and continue to the end. There's a real helpful guide for reading traceroutes/MTRs here if you're curious to learn more: https://archive.nanog.org/sites/default/files/10_Roisman_Traceroute.pdf
noted! absolutely, cheers for sending that – will take a deep look
managed to replicate it again:
1.
mtr
does not show issues from the glbx.net.ar
hop to the last one (at most 1-2% packet loss, latency is usual)
2. confirming I'm getting EZE (in https://cloudflare-test.judge.sh/ I sometimes get Business/Ent SCL but the zone is currently Pro)
3. using 1.1.1.1 as dns or not doesn't change – but enabling 1.1.1.1 + WARP does seem to immediately mitigate the problem.
Connections open longer is more expensive, less expensive to just insta-block and more on.I wonder if this is it and we're somehow hitting a security mechanism on Cloudflare Worker (our zone) => Cloudflare pages (either our zone or pages.dev). It's odd as the issue doesn't just happen when "overusing" e.g. mad refreshing the sites – but it's one way that sometimes works to replicate it. There's no security event for country Argentina under Firewall (which either us or the EZE colo where the proxy run should be triggering). There are
Client Disconnected
errors in the Worker, nothing of interest appears when viewing the Worker logs. The proxy does re-use the incoming request object so CF-* headers etc pass through (I see Pages logs geting X-* ip headers)
In general most of these are immutable assets which are even yielding CF-Cache-Status: HIT (meaning... going through the Worker but not actually reaching Pages, they'd resolve from CDN cache altogether).I wonder if this is it and we're somehow hitting a security mechanism on Cloudflare Worker (our zone) => Cloudflare pages (either our zone or pages.dev). It's odd as the issue doesn't just happen when "overusing" e.g. mad refreshing the sites – but it's one way that sometimes works to replicate it. There's no security event for country Argentina under Firewall (which either us or the EZE colo where the proxy run should be triggering).It's not a security thing, I can almost guarantee you that. It's just not what's done. You'd see straight errors, not slowness. Last time they throttled someone using Workers due to congestion (not security -- just an overloaded ix they needed to shed traffic on) it was always on/persistent and they wrote a whole blog post saying how they'd improve their processes and require approval & notification of the customer, etc. Just doesn't fit the bill.
using 1.1.1.1 as dns or not doesn't change – but enabling 1.1.1.1 + WARP does seem to immediately mitigate the problem.What colo do you get when connected to WARP?
always
colo=EZE
good context on it not being security (TBH this was my major concern in terms of something that could potentially affect more users – I don't really mind if it turns out that this specific ISP has issues). It's interesting that the other person who can replicate it, with the same ISP, is not geographically close to me (although we'd both get routed to EZE too). Still sounds like an issue that will be personally insightful to troubleshoot/understand morehttps://blog.cloudflare.com/tcp-resets-timeouts/
https://radar.cloudflare.com/security-and-attacks/as7303#tcp-resets-and-timeouts
(still processing through the article but very interesting timing for this blog to come out haha – looks like there are spikes of Post-SYN but seem actually in Argentina altogether not just AS7303 although it makes up for a large % of the attacks/traffic probably)
Although these requests are going as HTTP3 so they're via QUIC not TCP