@AJR Hmm, I _think_ I _may_ have found a

@AJR Hmm, I think I may have found a pretty bad bug. When you reconnect, are you doing a fresh DNS lookup, or are you maybe accidentally using a cached entry?
17 Replies
AJR
AJR•3w ago
Reconnect in what sense? Like spinning up a new connection, or taking one from the connection pool?
dave
daveOP•3w ago
I should give context first, my bad. I have two instances in an RDS cluster. One endpoint is meant for RW, the other is meant for RO. when AWS upgrades the cluster, it first reroutes the RO endpoint to point to the same primary instance (that was previously just for RW). then it takes down the instance that was previously used for RO once the "RO" instance is done upgrading, it boots it back up. It switches both DNS entries to point to the "RO" instance and allows writes to it. it then takes down the "RW" instance. when the RW instance is back up, it reroutes the RW DNS endpoint to be the "normal" RW instance. It then no longer allows the RO instance to be used for RW. at least this is from what I understand. the DNS endpoints look like this:
example-production.cluster-asdfasdf.us-east-1.rds.amazonaws.com
example-production.cluster-ro-asdfasdf.us-east-1.rds.amazonaws.com
example-production.cluster-asdfasdf.us-east-1.rds.amazonaws.com
example-production.cluster-ro-asdfasdf.us-east-1.rds.amazonaws.com
and the CNAME for them has a TTL of 5 seconds.
AJR
AJR•3w ago
Hmmm. So we keep connections live in the pools for a very long time, given that a large part of the point of the pooler is to not have to spin up new ones . Given that these changes are all at the DNS level I suspect they would not sever existing connections, so traffic will still be sent to the original destinations throughout Does that align with what you're seeing? Hmm no. The cluster rebooting should sever the connection. So ok, what behavior are you observing during this process?
dave
daveOP•3w ago
Yeah, that's what I thought so too. The RW connection was completely failing when the primary RW instance was being upgraded, almost like Hyperdrive did not see the new DNS entry. I do have VPC flow logs, that might be helpful? 🙂
AJR
AJR•3w ago
I'd have said we do fresh DNS lookups on each new connection but clearly there's a gap somewhere. I'll pass this to the team to take a look. Any logs you have would be helpful, yes please.
dave
daveOP•3w ago
there is also a chance that maybe I don't understand failover is meant to work with RDS.. but that seems wrong, since the AWS event logs seem to imply that they expect things to continue to work. one sec while I prep the logs
AJR
AJR•3w ago
We'll take a look. This isn't a scenario I, personally, had considered much. It's possible there's some caching in one of our layers, the networking stack is pretty involved as you might imagine.
dave
daveOP•3w ago
still preparing the logs I'm getting some PostgresError: Internal error. now for some reason Can I DM you? Yeah 100% errors now on some of the queries, this is not good.
AJR
AJR•3w ago
Sure DM me
dave
daveOP•3w ago
Ahhh, so I figured it out. It was because AWS moved one of my instances into a private subnet after the upgrading...
AJR
AJR•3w ago
I see. How are things looking now?
dave
daveOP•3w ago
@AJR tl;dr: there is a "bug", or rather lack of a feature: When the IP changes for a Hyperdrive hostname, it seems like Hyperdrive never uses the new IP (or at least not for a long time). IMO when a new IP is detected, new connections should be made and all new queries should go to the new endpoint.
AJR
AJR•3w ago
Interesting. So we'd want to re-resolve the DNS periodically, and if it resolves to a different IP we need to axe the current pool and spin up a new one Okay. I'll get that written up for when we have some cycles to spend on it. I think that's perfectly reasonable.
dave
daveOP•3w ago
Without it, I think there might be issues when one of the RDS instances goes from RW -> RO (if the connections aren't killed, Hyperdrive will be trying to do insert statements in a read-only database).
AJR
AJR•3w ago
Yeah, that sort of DNS-level sleight of hand, after a pool with connections is already spun up, would go completely unnoticed by Hyperdrive. RDS would need to sever the connections for us to spin up new ones, and if it doesn't then we just wouldn't. Agreed. Not a huge gap but also not the hardest to close. Might as well.
dave
daveOP•3w ago
Can I update just the hostname via the IP without needing to update the password/username?
AJR
AJR•3w ago
You should be able to with the patch API endpoint, yes

Did you find this page helpful?