Cloudflare Developers•2mo ago

@AJR Hmm, I _think_ I _may_ have found a

@AJR Hmm, I think I may have found a pretty bad bug. When you reconnect, are you doing a fresh DNS lookup, or are you maybe accidentally using a cached entry?

17 Replies

AJR•2mo ago

Reconnect in what sense? Like spinning up a new connection, or taking one from the connection pool?

daveOP•2mo ago

I should give context first, my bad. I have two instances in an RDS cluster. One endpoint is meant for RW, the other is meant for RO. when AWS upgrades the cluster, it first reroutes the RO endpoint to point to the same primary instance (that was previously just for RW). then it takes down the instance that was previously used for RO once the "RO" instance is done upgrading, it boots it back up. It switches both DNS entries to point to the "RO" instance and allows writes to it. it then takes down the "RW" instance. when the RW instance is back up, it reroutes the RW DNS endpoint to be the "normal" RW instance. It then no longer allows the RO instance to be used for RW. at least this is from what I understand. the DNS endpoints look like this:

example-production.cluster-asdfasdf.us-east-1.rds.amazonaws.com
example-production.cluster-ro-asdfasdf.us-east-1.rds.amazonaws.com

example-production.cluster-asdfasdf.us-east-1.rds.amazonaws.com
example-production.cluster-ro-asdfasdf.us-east-1.rds.amazonaws.com

and the CNAME for them has a TTL of 5 seconds.

AJR•2mo ago

Hmmm. So we keep connections live in the pools for a very long time, given that a large part of the point of the pooler is to not have to spin up new ones . Given that these changes are all at the DNS level I suspect they would not sever existing connections, so traffic will still be sent to the original destinations throughout Does that align with what you're seeing? Hmm no. The cluster rebooting should sever the connection. So ok, what behavior are you observing during this process?

daveOP•2mo ago

Yeah, that's what I thought so too. The RW connection was completely failing when the primary RW instance was being upgraded, almost like Hyperdrive did not see the new DNS entry. I do have VPC flow logs, that might be helpful? 🙂

AJR•2mo ago

I'd have said we do fresh DNS lookups on each new connection but clearly there's a gap somewhere. I'll pass this to the team to take a look. Any logs you have would be helpful, yes please.

daveOP•2mo ago

there is also a chance that maybe I don't understand failover is meant to work with RDS.. but that seems wrong, since the AWS event logs seem to imply that they expect things to continue to work. one sec while I prep the logs

AJR•2mo ago

We'll take a look. This isn't a scenario I, personally, had considered much. It's possible there's some caching in one of our layers, the networking stack is pretty involved as you might imagine.

daveOP•2mo ago

still preparing the logs I'm getting some PostgresError: Internal error. now for some reason Can I DM you? Yeah 100% errors now on some of the queries, this is not good.

AJR•2mo ago

Sure DM me

daveOP•2mo ago

Ahhh, so I figured it out. It was because AWS moved one of my instances into a private subnet after the upgrading...

AJR•2mo ago

I see. How are things looking now?

daveOP•2mo ago

@AJR tl;dr: there is a "bug", or rather lack of a feature: When the IP changes for a Hyperdrive hostname, it seems like Hyperdrive never uses the new IP (or at least not for a long time). IMO when a new IP is detected, new connections should be made and all new queries should go to the new endpoint.

AJR•2mo ago

Interesting. So we'd want to re-resolve the DNS periodically, and if it resolves to a different IP we need to axe the current pool and spin up a new one Okay. I'll get that written up for when we have some cycles to spend on it. I think that's perfectly reasonable.

daveOP•2mo ago

Without it, I think there might be issues when one of the RDS instances goes from RW -> RO (if the connections aren't killed, Hyperdrive will be trying to do insert statements in a read-only database).

AJR•2mo ago

Yeah, that sort of DNS-level sleight of hand, after a pool with connections is already spun up, would go completely unnoticed by Hyperdrive. RDS would need to sever the connections for us to spin up new ones, and if it doesn't then we just wouldn't. Agreed. Not a huge gap but also not the hardest to close. Might as well.

daveOP•2mo ago

Can I update just the hostname via the IP without needing to update the password/username?

AJR•2mo ago

You should be able to with the patch API endpoint, yes

Gaming

Programming

@AJR Hmm, I _think_ I _may_ have found a

Did you find this page helpful?