Dangling Execution Issue
Been having a weird prod issue, so I want to see if anyone has some bright ideas.
We run a load balanced cluster (3/3) for a Spring app, and recently if we change Traffic from A -> B, then we get a random spike in DB query times, even if there are no DB changes in the B Cluster's Code.
Does Spring have any issues where a connection isn't closed if the HTTP Session/Request is suddenly killed? For example a Load Balancer changing traffic? Since the normal fix is to Cycle the servers.
46 Replies
⌛
This post has been reserved for your question.
Hey @Crain! Please useTIP: Narrow down your issue to simple and precise questions to maximize the chance that others will reply in here./close
or theClose Post
button above when you're finished. Please remember to follow the help guidelines. This post will be automatically closed after 300 minutes of inactivity.
are you sure this just isnt due to cache misses
on the server B
I'm positive.
a random spike in DB query timeson the DB side or on the Spring side? and which server receives the high DB query times? What do you mean with dangling exceptions? And did it only happen once or does it happen regularly?
Query Time spike. So assumed DB execution, could be Spring side.
Restarting one specific server fixed the issue, but the times could have leaked to others.
The dangling execution was mostly "Could this have been caused by a connection not being freed, causing a Table Lock"
It's happened one a week for the past 2 weeks. Only on traffic swaps.
Do you have any information where exactly this occured?
e.g. which endpoint, which SQL statement, which repository method, etc?
Do you have any DB-side metrics showing you response times etc?
No DB side metrics, that I know of at least.
We do have the endpoints/tables though. I'm thinking we just have to optimize those, or toss in some No-Locks :/
Is the problem happening in the processing of an endpoint?
if so, is it always the same endpoint?
Or always one of a few endpoints?
same with SQL queries/tables/repository methods
We're not sure, these endpoints are just the main ones associated with the increased query times.
So not sure if it's the cause, or a side-effect. But I'd assumed a mixture of both.
Did you try setting up the same load balancer with the same services locally?
It's an Azure Loadbalancer, so I can't set it up locally. But I run the service locally basically everyday, since it's the main application I work with.
Is it actually a pure Spring/JPA exception or does it come from (caused by section) something else (e.g. the DB)?
It's not an exception, it's just a massive query time spike.
Although to be fair... I don't know if we've configured a query timeout
with query time, you mean the time per query, right?
Correct
I assume all services go to the same DB?
Correct
and the load balancer is only about the connection between clients and the services and the connection between services and DB is direct?
Only Client -> Server, we directly connect to the DB, correct.
are the query spikes averages?
Yes
Not sure if we can get the individuals, that would be a good idea though
Which server are the query spikes occuring at? The server that is switched from or the server which is switched to?
Switched to
maybe you can get medians and/or quantiles
Which we found out after I made this post, to be clear, so "dangling execution" is incorrect
Oooo, quartiles are a good idea
How massive are the spikes?
Double the request time or is it more?
Feedback loop style. You can see where we rebooted the bad server.
and where did you switch?
?
where in the graph did you switch the services?
like where exactly did it start receiving more load?
At the spike, it reacted within a minute or two
Like it hit fast, I mostly showed before to include the Y Axis info*
you switched from service A to service B
where on the x-axis did this happen?
Where the graph spikes
It was literally within a minute of the switch.
ok so at the beginning
Do you have some sort of session affinity configured?
Through the load balancer.
Do you have some sort of test environment?
Several
which is also using that load balancer?
No
Is it viable to create a copy of that load balancer (maybe with an internal IP or whatever) for a test environment?
It's an Azure Gateway, so it's easy enough. We have a copy we use for one of the test environments
idk Azure Gateway
maybe try getting one test environment to be similar to the production environment
and then do the switch that causes problems
and maybe enable more detailed logging or monitoring for the test environment
although first try to reproduce it, then enable logging and reproduce it again
maybe Spring allows you to configure monitoring in a way that shows you what happens to each individual request
which could be viable for a test environment
Yeah, we were finally able to isolate the instance, so we'll investigate that.
Thanks for the ideas though, I'll definitely try some of them out
If you are finished with your post, please close it.
If you are not, please ignore this message.
Note that you will not be able to send further messages here after this post have been closed but you will be able to create new posts.
Post Closed
This post has been closed by <@190262684082503680>.