Workers running in a different continent from the requester
I'm in Brazil, my result for https://www.cloudflare.com/cdn-cgi/trace is
fl=97f585
h=www.cloudflare.com
ip=2804:7f4:REDACTED
ts=1732064404.743
visit_scheme=https
uag=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
colo=GRU
sliver=010-tier1
http=http/3
loc=BR
tls=TLSv1.3
sni=plaintext
warp=off
gateway=off
rbi=off
kex=X25519MLKEM768
And yet most of my workers are running either in EWR or IAD with high latency.
Those workers do some fetches and use hiperdrive, all to servers placed in Brazil, so running in the US is killing response times. Even Smart Placement has no effect.
I created a "hello worker" without smart placement just to double check and got the same results:
The IP is local and low latency(13ms) -> Remote Address: [2606:4700:3030::6815:3513]:443
But the actual execution is not(200ms+) -> cf-ray: 8e549030fbc66a4e-EWR / cf-ray:8e54982c1a7e28a0-IAD
There are plenty Cf POPs that are closer, lower latency, and yet the workers seem to be running in the US for some reason.
Is there anything that can be done to address this?
18 Replies
Is your zone Pro or higher?
I have workers Standard(Paid), accessing from worker-name.account.workers.dev so technically no zones involved I think 😅
Yeah, I think
workers.dev
counts as a Free-tier Zone, so that might be why
Paid-plan zones get access to colos in more expensive regions. I know that South America/Oceania are generally on the more pricier sidecan try https://debug.chaika.me/?findColo=true to see where you get placed per plan
Workers just run wherever your request gets routed to/on cdn layer
if setting up a zone in PRO solves the issue, I'll gladly do it 😄
Based on that smaller sample size, looks like only Business plan or higher is being routed locally for you. What's your ISP? Haven't heard too much about routing issues in South America, mostly just Indian ISPs demanding money for peering/causing eh routing
ISP is Vivo
Not sure it's ISP related though as it was working yesterday... perhaps prefix related? Os cost related as suggested by @Hard@Work ?
anything I could do? I can definitely do Pro if it will fix it, but Business is out of my price range unfortunatelly.
also, Vivo has 3 ASN (resulting from acquisitions) that I'm aware of so that might be it as well. I remember getting IPs from at least two of them in my home fiber at different times.
Routing is always ISP related at least to a degree. There's never any guarantees about which plans get which colocations or specific routes except for Enterprise, and even then ISPs can just override/do whatever they want if they want. You could VPN to a nearby datacenter, and you'd probably get routed to the local cloudflare data center on a datacenter connection. Otherwise it is probably ISP related and not Cloudflare location related as Cloudflare does have a good amount of capacity down there, but would need more reports/info to know
But I'm getting to the edge on GRU, so it does not seem routing related. I'm 13 ms form the IP that is responding to my request. It is just the worker allocation that seems to be traversing the CF network and running up north.
just re-ran the test. I'm resolving to Remote Address: [2606:4700:3030::6815:3513]:443, my latency to that address is under 20ms, but the cf-ray comes marked as IAD. based on that, I would assume it is not a routing issue as I'm entering CF at a POP that is near, perhaps not the nearest, but good enough. From there I'm then internally routed to IAD.
on a separate note, it seems not even the plan tier can guarantee placement since in this last run PRO is local but BIZ is remote.
Well, I'd appreciate any tips as to how to route stuff locally. Running in any in-coutry DC would probably work latency wise, but when the worker is running in the us but pulling data from a DB in Brazil it just doesn't work latency wise due to the DB RTT time. before anyone suggests, D1 is also not an option as it does not run in south america at all.
But I'm getting to the edge on GRUIt's absolutely routing related, you're not consistently being routed to the closer location. Your note that routing isn't consistent/sometimes you hit gru on free plan reinforces that, sometimes issues like these are transient and just get fixed, ISP trying to optimize routes and accidentally flinging traffic far away, etc. Higher plans get access to more paths/routes overall which helps If you don't have smart placement on, workers run on the same edge machine they hit/same cdn machine, they're not routed to a workers cluster or anything like that, so pretty simple deployment. Routing is a sort of shared responsibility, the provider (Cloudflare) publishes specific routes/paths to reach them. Your ISP has the final say in which one they pick though. Generally when stuff goes wonky it's safe to blame the ISP, since they can always manually override stuff/it's usually in their interest to find a way to keep traffic local & cheap, could be Cloudflare changing stuff too, but would need more samples/info to say anything more. Traceroute of the bad route might show something more interesting, but probably wouldn't be too helpful
IP wise I am being routed consistently to a local IPv6 according to ICMP. The CF resolved IP address changes but it is consistently under 20ms for response time on ICMP. What is not being consistently routed is the worker request, internally it appears. I seem to enter the CF network at a local POP, usually either GRU or GIG, ICMP shows that, on both ping and traceroute, and SSL negotiation times do as well. Once I gent into the CF network, my workers run either in IAD or EWR, so this is CF routing the HTTP request internally, not at the peering layer but once they already have control of the request.
From what I remember from the time I worked in telco, a route is a route. Even when it's flapping, requests at(virtually) the same time, will hit the same path, and BGP is not all that fast(talking seconds, no ms here). at the IP level I am hitting local 100% of the time according to ICMP (when pinging the same IP curl is resolving) so it's hard to imagine a situation where my ICMP request would hit local but the HTTP request hit remote for the same destination IP at the same time, even with anycast. That sort of network problem would mean trouble for the most basic stuff like handshaking on TCP or SSL. That is why I'm assuming I hit local and after I hit local, then the request get's routed/proxied internally, by CF, at the http layer, to a remote location. I don't know why CF would do that, but I imagine limited capacity at the POP would be one.
As for workers running on the same edge machine they hit, that is the theory, yes. But on partial rerouting scenarios, load shedding and limited CPU/Memory capacity I have to imagine it gets more complicated than that and they would rather proxy the request than fail it altogether. That being the case, then I would also assume there are provisions in the infra to trigger that rerouting for other reasons.
I'm guessing here on everything besides the network stuff so take it with a grain of salt.
At the end of the day, I don't really mind being proxied/routed away from my local POP, but I would love to be proxied to one of the dozens of POPs that are in the same country at least. Crossing a continent is what adds latency and it's what's creating problems for me so I think I only have two questions:
1) Why go to a different continent instead of a closer POP? (capacity?)
2) Is there anything that can be done about it?
The debug test site uses the virtual /cdn-cgi/trace paths, not Workers, so it'd probably just be a network level forward for http requests or something simpler. They've posted about their traffic manager before: https://blog.cloudflare.com/meet-traffic-manager, if they run out of capacity for a specific plan they remove anycast routes and predict where it's going to land
Considering their posts on the architecture, there is not much network level when it hits a POP, everything is application level from there on(from a user request at least). What I mean by that is that I assume anycast IPs are frontline only and everything internal will used DC/POP specific IPs so that location control is retained. TLS is terminated at ingress and then they implement what they call a "chain of proxies". The only option to move that load before that would be the L4 router, but that is also under CF control, so not related to the ISP. After that, there is a number of places where this "proxy to remote" could happen in that image. The way they architect is pretty interesting but when they say (in many blog posts) that every server runs every service I believe we should not take it to mean one request will always be resolved within one server. Locking themselves up in that would remove a great deal of flexibility from the infrastructure and mean they would have to run every server with a larger safety margin. Retaining the ability to fence off a request at any point in the chain or at least in a few key points would be the way to go, albeit adding complexity to the stack.
On a separate note there is more and more CF stuff running on top of Workers. If they kept working on a successor to Flame and it succeeded, then there is a possibility that even /cdn-cgi/trace could be answered by a worker. https://blog.cloudflare.com/building-cloudflare-on-cloudflare/#can-we-replace-internal-services-with-workers
yea they do have a lot of good blogs about their arch, here's the one about their internal lb: https://blog.cloudflare.com/unimog-cloudflares-edge-load-balancer/, it's just l4 to the server which handles the entire request. Flame will be interesting but I don't think it changes much here
it is curious though, I was able to reproduce what you saw. Randomly requests hitting GRU show as being handled by MIA/IAD/EWR, it looks very network level though as you can make a request to a biz site with an ent ip (and curl override) and not see it, and vice versa, to ent w/ biz ip, and you do see it/randomly some requests ending up handled out of country. The fact they're going so far away is weird when there's lots of closer locations, must be pretty limited in scope because there's no other forum mentions or anything
We try to serve traffic as close as possible, however sometimes due to operational reasons (temporary capacity issues with compute or network capacity for various reasons such as subsea cuts etc) this is not possible, and our traffic tools will move traffic around to avoid impact
While it's true higher plans have less of a chance to be shifted, it's not a guarantee, as our operational needs to avoid impact will outweigh serving traffic in a specific POP if it'd cause an impact to traffic
As you can see here, a certain % of traffic is being shifted out from some plans, which is why occasionally the traffic is being served from gru
If you can log a support case with what your workers are doing and the impact your users are observing when the shift occurs, we can use it to help prioritise any potential capacity increases in the region - so I highly recommend logging a ticket
Can you please share a traceroute/MTR as well? I'll check what colo in gru it's hitting
The going far is the part that troubles me... moving from 2-10 ms RTT to 100-150ms RTT to my database means 3 quick queries go form 15-20ms to 400-400ms. Response time increases build up real fast.
I'm on Workers paid but CDN free... When I try to go into the ticketing area I get sent here: https://developers.cloudflare.com/support/account-management-billing/cannot-locate-dashboard-account/
My workers connect to a PostgreSQL database, run a few queries and return a json, really simple stuff. I tried to use D1, only to find it does not run in SA and that means a guaranteed 150-200ms best case scenario, that was a bit of a let down, hence the move to PG.
Sure thing. I'll run them later today as soon as I get some time.
2606:4700:3033::ac43:cfc6 cycles the cf-ray between GRU/GIG/MIA/SJC/IAD
2606:4700:3030::6815:3513 cycles the cf-ray between GRU/GIG/IAD
BTW, GIG is new, when I first started the thread I hadn't seen GIG a single time. Latencies for both GRU/GIG are good, MIA/SJC/IAD not so much... It seems like there are 32 POPs in Brazil, 9 of them have AI inference so I'm assuming larger facilities. 68 POPs in Latin America overall. I'll bet at least 40 of them would be better options for me latency-wise. I'm assuming by the fact that my requests are going for a swim and ending up in the US that most of latin america is capacity constrained right now in one way or another. Would that be a fair assumption?
Also, you mentioned operational needs vs specific POPs... My issue is being too far from my database, there are at least 40 or 50 POPs that are closer latency wise. Is there a plan that would keep me close(r)?
I develop stuff for contract and average latency is usually part of the spec. If there is a plan, I can always offer to the customers and if they are fine with the cost that would work as well.
On a completely different front. Having D1 available in a location is Latin America also solves my data locality problem.