Cloudflare Developers•6d ago

Screenshot 2025-04-14 at 03.56.10

So my Hyperdrive connection with MySQL just started throwing errors, or well it started not responding. https://screen.bouma.link/fGflgtK5X2vH5wrX98nZ From the Cloudflare Dashboard everything is "Active" (Hyperdrive) and "Healthy" (the Tunnel) and cloudflared also is runnign without any log output. But workers connecting throw:

Connection lost: The server closed the connection.

Connection lost: The server closed the connection.

Any clue where to start debugging this? The MySQL server is doing fine and nothing has changed on my end (I was asleep when the incident started). But I expected there to be some visible fault somewhere 😅 Cloudflare side issue?

CleanShot Cloud

Screenshot 2025-04-14 at 03.56.10

Screenshot

30 Replies

AJR•6d ago

Need your Hyperdrive ID, and I'll take a look in the morning. I assume direct connections with a different mysql client work correctly?

AlexOP•6d ago

Yes, in fact I removed the worker route to let it fallback to my origin and everything is working again (where my origin is talking to the same database). Luckily the worker is just a "optimization" basically so removing it is fine. Thanks for looking into this, I'll be trying to catch some winks too so be back in a couple of hours 🙂 The ID: e091675e42a94e789ab05718442dce6a

AJR•6d ago

I also see all your metrics fall to 0 at that time. I see you're going across a tunnel. My first thing to check would be to restart the tunnel with loglevel=debug, to see if you're still successfully authing through there.

AlexOP•6d ago

So restarting cloudflared did seem to help. Which is pretty unfortunate. The problem did came back pretty quick though within a few minutes. Now running under debug log level but so far no output other then the startup.

cloudflared --no-autoupdate tunnel --loglevel debug --log-directory /var/log/cloudflared run --token <REDACTED

cloudflared --no-autoupdate tunnel --loglevel debug --log-directory /var/log/cloudflared run --token <REDACTED

is running at the moment. Also opened teh connector diagnostics page in Zero Trust dashboard. No errors I can find there. Access analytics show no failed logins for the Hyperdrive application. It shows many successful authentication attempts. Database is still running perfectly and has ~45 connections left. No hyperdrive connections are made to it and workers is still throwing errors. Restarting cloudflared and re-deploying the worker seems to have no effect. Hopefully y'all will be able to tell me where the pain lies and what broke because I am at a loss 😅 The worker has been running without issues since saturday morning and stopped working sunday evening. I have not deployed the worker or made any changes to the server config for the whole of sunday.

AJR•6d ago

Ok, for now this is going into the category of "beta bug that we're still working to RCA". If it ends up being tunnel weirdness we'll figure that out, but I'm working from the assumption that this is a gap somewhere in how we're handling the wire protocol. With that said (feel free to answer in DMs if you're more comfortable with that): * Can you share as much as possible about your hosting, including specific MySQL version, on-prem vs PaaS, etc. * Can you share as much as possible about your queries/access patterns. Query examples, are you using transactions, etc etc. Thank you!

AlexOP•6d ago

Interesting, okay cool. let me try and answer as much as possible, luckily it's IMHO a very simple setup which might help 🙂 - I am using mysqld Ver 8.0.41-0ubuntu0.20.04.1 for Linux on x86_64 on a self-managed VPS not with a public cloud provider. It has both v4 and v6 internet connectivity. The tunnel is configured to talk to 127.0.0.1 :3306. I have configured it with a user with only select and show view privileges on a single database. - I run 2 queries in my worker. As far as I can tell the first query already fails (which means it doesn't execute the second. I am using Drizzle with the MySQL2 connector. According to Drizzles logs it executes:

Query: select `id`, `uuid`, `team_id`, `domain`, `with_links`, `is_default`, `target`, `include_path`, `include_query`, `redirect_default_not_found` from `custom_domains` where `custom_domains`.`domain` = ? limit ? -- params: ["example.com", 1]

Query: select `id`, `uuid`, `team_id`, `domain`, `with_links`, `is_default`, `target`, `include_path`, `include_query`, `redirect_default_not_found` from `custom_domains` where `custom_domains`.`domain` = ? limit ? -- params: ["example.com", 1]

I am happy to answer any specific questions and/or run some non-destructive commands or even give you access to the Hyperdrive if needed (since it's read-only it's no problem). But of course if we are arranging that we should move to DM's 😄

AlexOP•5d ago

Eyyy! We back fam? https://screen.bouma.link/9YMlLrNKZV9YpxB1Q6WP 😄

CleanShot Cloud

Screenshot 2025-04-15 at 13.18.08

Screenshot

AlexOP•5d ago

"Magically" started working again

AJR•5d ago

Man. I'm gonna have your ID memorized by May. I can tell.

AlexOP•5d ago

Not sure that's a good thing... 🫣

AJR•5d ago

We haven't released any changes since yesterday, to be clear.

AlexOP•5d ago

Oh... I did do an deployment this morning

AJR•5d ago

Worker or Hyperdrive?

AlexOP•5d ago

Worker

AJR•5d ago

Okay. That shouldn't interact with your Hyperdrive config at all, really. Just for context. I'm going to start with another run through of logs for you when I get to my desk this morning. I want to see how that all looks.

AlexOP•5d ago

I ran yarn upgrade (not seeing mysql2 in there or other related libs from a quick glance) and I also lowered the compat date to 2025-04-02. No actual code changes. In case it matters. Let's see how long it keeps working this time then! Also not touched the MySQL server at all. So not even 3 hours from the looks of it. I did notice when looking at my MySQL server process list that when the first errors started rolling in there were 2 connections, and then 1 and now 0. It took a minute for requests to start consistently failing too. Guessing some of it was also the query cache. But now 100% failure rate again. In cloudflared logs I see:

{"level":"debug","event":1,"connIndex":0,"originService":"tcp://127.0.0.1:3306","ingressRule":0,"destAddr":"tcp://127.0.0.1:3306","time":"2025-04-15T11:39:48Z","message":"upstream->downstream copy: read tcp 127.0.0.1:42476->127.0.0.1:3306: use of closed network connection"}

{"level":"debug","event":1,"connIndex":0,"originService":"tcp://127.0.0.1:3306","ingressRule":0,"destAddr":"tcp://127.0.0.1:3306","time":"2025-04-15T11:39:48Z","message":"upstream->downstream copy: read tcp 127.0.0.1:42476->127.0.0.1:3306: use of closed network connection"}

Check MySQL values and wait_timeout is 8 hours. Not sure if other timeouts could be in play here which is what my first thought went to seeing this behaviour. I would still expect Hyperdrive to handle this and create a new connection but maybe it detects a max_connections situation wrongly here. But I am now assuming based on nothing... I'll let you do the actual root causing here!

AJR•4d ago

Agreed at least that Hyperdrive is designed to drop bad connections and spin up a new one. That's a good angle to pursue also. Independent of why things fell out of sync somehow, why is it not detecting that and doing the obvious thing. I'll keep you posted Quick followup here. We're adding some additional robustness to the health checks and autorefresh behavior for MySQL connections. That'll go out in our next release, starting either today or tomorrow and done by Friday/Monday.

AlexOP•4d ago

Hope to see stable service after that 🤘 Thanks for the update!

AJR•3d ago

@Alex The release is out, we should be in a better spot for dropping/replacing bad connections for MySQL configs. Please let me know how it goes for you.

AlexOP•3d ago

Very much going in the right direction! https://screen.bouma.link/TmpHFwkgHQv2KM6CDTsK

CleanShot Cloud

Screenshot 2025-04-17 at 22.11.25

Screenshot

AlexOP•3d ago

Let's see how it holds up over the weekend! Currently seeing ~22 connection to MySQL, which is way more then before (don't think I've seen more then 2 before). So something is definitly better!

AlexOP•2d ago

So no we are moving the other way 🤣 I have a 51 connection limit on my database, which should be way plenty but Hyperdrive is keeping 30+ connections idle for long times: https://screen.bouma.link/V2zQ00XFj5yCB9j0jYnH

CleanShot Cloud

Screenshot 2025-04-18 at 10.36.03

Screenshot

AlexOP•2d ago

Some connection have been idle 4+ hours In addition it also had ~14 actively (within last 60s) connections That broke my app 🙈 And this time not just the worker

AJR•2d ago

Well that's not supposed to happen. We drop idle connections after 15 minutes. Generally the way this should work is that it will aggressively open connections whenever all available ones are in use, up to 60. Anything that hasn't had traffic in 15 minutes should be disconnected, though I'm assuming you don't have any middleware in your stack that'll hold things open until it gets an explicit close message?

AlexOP•2d ago

Since I am using Drizzle, I am not a 100% sure what it exactly is doing ofcourse. And I am not explicitely closing the connection to Hyperdrive either. But I also wouldn't expect a single instance of an isolate to live 4+ hours without any requests. I am at least not doing anything with the connection explicitly. I am even connecting in the fetch handler opposed to in the global scope.

AJR•2d ago

Hyperdrive exists separate from the isolate. Couldn't have warm connections otherwise. But no, it should only live for 15 minutes without traffic I'm planning to bring this to the team, and we'll dig in starting today.

AlexOP•2d ago

Happy to provide any details if that helps. I can also share the worker code if that helps.

knickish•2d ago

I think we've found the root cause of this issue, will let you know here once we've confirmed that and released a fix for it. Thanks for your patience

AlexOP•2d ago

No worries. Happy to “help” nail this down by breaking it.

AJR•2d ago

No quotes needed, every problem you find is one less that everyone has to deal with. We very much appreciate it.

Gaming

Programming

Screenshot 2025-04-14 at 03.56.10

Did you find this page helpful?