Only one tunnel gets checked by the load balancer monitor

I have a load balancer with two tunnels attached. For some reason, only one tunnel seems to get monitor checks. I think this might also apply to health check traffic, but I need to double check. Traffic is otherwise split roughly equally between the two tunnels, and when there is an outage in that cluster that affects both, the tunnel that doesn't seem to get monitor checks will still register as "healthy". Does anyone have any idea what might be going on here?
11 Replies
nomaxx117
nomaxx117OP11mo ago
confirmed that all the health checks also seem to go to one tunnel
Cyb3r-Jak3
Cyb3r-Jak311mo ago
Random shot in the dark. Are the tunnels connected to the same DCs? Tunnel traffic typically through the closest location (not officially though). Wondering if the health checks are coming from a DC where the tunnel is connected to
nomaxx117
nomaxx117OP11mo ago
They are both connected to the DFW-A PoP, though different colos I've turned off the health check's right now to reduce the noise a bit and focus on the knobs I have around the monitors
Cyb3r-Jak3
Cyb3r-Jak311mo ago
There might be some internals of how the tunnel is traffic is being routed but no clue
nomaxx117
nomaxx117OP11mo ago
that's what i'm wondering i've uninstalled and reinstalled that tunnel to no avail strange, removing the tunnel that gets all the monitor requests still leaves the other tunnel getting none umm, why are they both one connector? how did i get myself into this situation? wait a minute on worker 1 (the one getting all the traffic), cloudflared tunnel info returns the same id for both tunnels - the id of worker 1's tunnel on worker 2, the correct ids are concerned wat i am bamboozled
Cyb3r-Jak3
Cyb3r-Jak311mo ago
That’s uh funky. Like the connector or the tunnel IDs?
nomaxx117
nomaxx117OP11mo ago
from heavy-worker-1
➜ ~ cloudflared tunnel info heavy-worker-1
NAME: heavy-worker-1
ID: aff69054-<REST>
CREATED: 2023-05-19 22:41:36.32161 +0000 UTC

CONNECTOR ID CREATED ARCHITECTURE VERSION ORIGIN IP EDGE
7b612dfc-<REST> 2024-01-14T00:34:27Z linux_arm64 2024.1.2 104.13.171.136 1xdfw01, 1xdfw05, 2xmci01
➜ ~ cloudflared tunnel info heavy-worker-2
NAME: heavy-worker-1
ID: aff69054-<REST>
CREATED: 2023-05-19 22:41:36.32161 +0000 UTC

CONNECTOR ID CREATED ARCHITECTURE VERSION ORIGIN IP EDGE
7b612dfc-<REST> 2024-01-14T00:34:27Z linux_arm64 2024.1.2 104.13.171.136 1xdfw01, 1xdfw05, 2xmci01
➜ ~ cloudflared tunnel info heavy-worker-1
NAME: heavy-worker-1
ID: aff69054-<REST>
CREATED: 2023-05-19 22:41:36.32161 +0000 UTC

CONNECTOR ID CREATED ARCHITECTURE VERSION ORIGIN IP EDGE
7b612dfc-<REST> 2024-01-14T00:34:27Z linux_arm64 2024.1.2 104.13.171.136 1xdfw01, 1xdfw05, 2xmci01
➜ ~ cloudflared tunnel info heavy-worker-2
NAME: heavy-worker-1
ID: aff69054-<REST>
CREATED: 2023-05-19 22:41:36.32161 +0000 UTC

CONNECTOR ID CREATED ARCHITECTURE VERSION ORIGIN IP EDGE
7b612dfc-<REST> 2024-01-14T00:34:27Z linux_arm64 2024.1.2 104.13.171.136 1xdfw01, 1xdfw05, 2xmci01
from heavy-worker-2:
➜ ~ cloudflared tunnel info heavy-worker-1
NAME: heavy-worker-1
ID: aff69054-<REST>
CREATED: 2023-05-19 22:41:36.32161 +0000 UTC

CONNECTOR ID CREATED ARCHITECTURE VERSION ORIGIN IP EDGE
7b612dfc-<REST> 2024-01-14T00:34:27Z linux_arm64 2024.1.2 104.13.171.136 1xdfw01, 1xdfw05, 2xmci01
➜ ~ cloudflared tunnel info heavy-worker-2
NAME: heavy-worker-2
ID: b7561864-<REST>
CREATED: 2024-01-14 00:49:37.245692 +0000 UTC

CONNECTOR ID CREATED ARCHITECTURE VERSION ORIGIN IP EDGE
d537845f-<REST> 2024-01-14T00:51:00Z linux_arm64 2024.1.2 104.13.171.136 1xdfw06, 1xdfw09, 2xmci01
➜ ~ cloudflared tunnel info heavy-worker-1
NAME: heavy-worker-1
ID: aff69054-<REST>
CREATED: 2023-05-19 22:41:36.32161 +0000 UTC

CONNECTOR ID CREATED ARCHITECTURE VERSION ORIGIN IP EDGE
7b612dfc-<REST> 2024-01-14T00:34:27Z linux_arm64 2024.1.2 104.13.171.136 1xdfw01, 1xdfw05, 2xmci01
➜ ~ cloudflared tunnel info heavy-worker-2
NAME: heavy-worker-2
ID: b7561864-<REST>
CREATED: 2024-01-14 00:49:37.245692 +0000 UTC

CONNECTOR ID CREATED ARCHITECTURE VERSION ORIGIN IP EDGE
d537845f-<REST> 2024-01-14T00:51:00Z linux_arm64 2024.1.2 104.13.171.136 1xdfw06, 1xdfw09, 2xmci01
heavy-worker-1 gets all the monitor traffic how is this possible i continue to find novel ways of breaking computers lol @Cyb3r-Jok3 i legitimately have no idea how i did this lmao
Cyb3r-Jak3
Cyb3r-Jak311mo ago
lol heavy-worker-1 work seems cursed
nomaxx117
nomaxx117OP11mo ago
really is lol i'm just gonna delete the tunnel and make a new one
Cyb3r-Jak3
Cyb3r-Jak311mo ago
Remote managed tunnels for the win
nomaxx117
nomaxx117OP11mo ago
that did not fix the issue i am bamboozled neither did making the tunnels remote how would one do this? also, i too am curious about how i got myself into this mess 😂 i'll go check these Cloudflared tunnel metrics show even rps to each tunnel weird but this disagrees with what logs show when i tail journalctl i'll do more digging, wonder if something is borked on my end will be double checking my logging So, I figured this out. My nginx cache was broken on the node seeing elevated traffic, and my metrics were generated in a proxy which was layered after nginx. I first instrumented cloudflared and saw even RPS. Then I instrumented nginx and saw the same. I then realized that the node with higher traffic saw identical traffic before and after the nginx layer, despite the supposed presence of a cache, so maybe that was the broken node? Tailing the error logs there revealed that there was a classic permissions failure there causing the cache to be circumvented. As far as why only one node got health check alerts, it appears to be due to the nature of the outages I was looking at - the outages were with things like my Redis cluster, not the front-line workers themselves. The cache was preventing the health checks on the second worker from spotting availability issues. I remediated my issues by fixing the permissions issue and disabling caching on the health check endpoint I was using on both workers. Part of what threw me off here was that load was legitimately imbalanced - the lack of a cache meant that there were more RPS for one worker than another hitting everything behind nginx, so things like CPU and memory usage were higher.
Want results from more Discord servers?
Add your server