Dangling Execution Issue

Been having a weird prod issue, so I want to see if anyone has some bright ideas. We run a load balanced cluster (3/3) for a Spring app, and recently if we change Traffic from A -> B, then we get a random spike in DB query times, even if there are no DB changes in the B Cluster's Code. Does Spring have any issues where a connection isn't closed if the HTTP Session/Request is suddenly killed? For example a Load Balancer changing traffic? Since the normal fix is to Cycle the servers.
46 Replies
JavaBot
JavaBot2y ago
This post has been reserved for your question.
Hey @Crain! Please use /close or the Close Post button above when you're finished. Please remember to follow the help guidelines. This post will be automatically closed after 300 minutes of inactivity.
TIP: Narrow down your issue to simple and precise questions to maximize the chance that others will reply in here.
straightface
straightface2y ago
are you sure this just isnt due to cache misses on the server B
Crain
CrainOP2y ago
I'm positive.
dan1st
dan1st2y ago
a random spike in DB query times
on the DB side or on the Spring side? and which server receives the high DB query times? What do you mean with dangling exceptions? And did it only happen once or does it happen regularly?
Crain
CrainOP2y ago
Query Time spike. So assumed DB execution, could be Spring side. Restarting one specific server fixed the issue, but the times could have leaked to others. The dangling execution was mostly "Could this have been caused by a connection not being freed, causing a Table Lock" It's happened one a week for the past 2 weeks. Only on traffic swaps.
dan1st
dan1st2y ago
Do you have any information where exactly this occured? e.g. which endpoint, which SQL statement, which repository method, etc? Do you have any DB-side metrics showing you response times etc?
Crain
CrainOP2y ago
No DB side metrics, that I know of at least. We do have the endpoints/tables though. I'm thinking we just have to optimize those, or toss in some No-Locks :/
dan1st
dan1st2y ago
Is the problem happening in the processing of an endpoint? if so, is it always the same endpoint? Or always one of a few endpoints? same with SQL queries/tables/repository methods
Crain
CrainOP2y ago
We're not sure, these endpoints are just the main ones associated with the increased query times. So not sure if it's the cause, or a side-effect. But I'd assumed a mixture of both.
dan1st
dan1st2y ago
Did you try setting up the same load balancer with the same services locally?
Crain
CrainOP2y ago
It's an Azure Loadbalancer, so I can't set it up locally. But I run the service locally basically everyday, since it's the main application I work with.
dan1st
dan1st2y ago
Is it actually a pure Spring/JPA exception or does it come from (caused by section) something else (e.g. the DB)?
Crain
CrainOP2y ago
It's not an exception, it's just a massive query time spike. Although to be fair... I don't know if we've configured a query timeout
dan1st
dan1st2y ago
with query time, you mean the time per query, right?
Crain
CrainOP2y ago
Correct
dan1st
dan1st2y ago
I assume all services go to the same DB?
Crain
CrainOP2y ago
Correct
dan1st
dan1st2y ago
and the load balancer is only about the connection between clients and the services and the connection between services and DB is direct?
Crain
CrainOP2y ago
Only Client -> Server, we directly connect to the DB, correct.
dan1st
dan1st2y ago
are the query spikes averages?
Crain
CrainOP2y ago
Yes Not sure if we can get the individuals, that would be a good idea though
dan1st
dan1st2y ago
Which server are the query spikes occuring at? The server that is switched from or the server which is switched to?
Crain
CrainOP2y ago
Switched to
dan1st
dan1st2y ago
maybe you can get medians and/or quantiles
Crain
CrainOP2y ago
Which we found out after I made this post, to be clear, so "dangling execution" is incorrect Oooo, quartiles are a good idea
dan1st
dan1st2y ago
How massive are the spikes? Double the request time or is it more?
Crain
CrainOP2y ago
Feedback loop style. You can see where we rebooted the bad server.
No description
dan1st
dan1st2y ago
and where did you switch?
Crain
CrainOP2y ago
?
dan1st
dan1st2y ago
where in the graph did you switch the services? like where exactly did it start receiving more load?
Crain
CrainOP2y ago
At the spike, it reacted within a minute or two Like it hit fast, I mostly showed before to include the Y Axis info*
dan1st
dan1st2y ago
you switched from service A to service B where on the x-axis did this happen?
Crain
CrainOP2y ago
Where the graph spikes
No description
Crain
CrainOP2y ago
It was literally within a minute of the switch.
dan1st
dan1st2y ago
ok so at the beginning Do you have some sort of session affinity configured?
Crain
CrainOP2y ago
Through the load balancer.
dan1st
dan1st2y ago
Do you have some sort of test environment?
Crain
CrainOP2y ago
Several
dan1st
dan1st2y ago
which is also using that load balancer?
Crain
CrainOP2y ago
No
dan1st
dan1st2y ago
Is it viable to create a copy of that load balancer (maybe with an internal IP or whatever) for a test environment?
Crain
CrainOP2y ago
It's an Azure Gateway, so it's easy enough. We have a copy we use for one of the test environments
dan1st
dan1st2y ago
idk Azure Gateway maybe try getting one test environment to be similar to the production environment and then do the switch that causes problems and maybe enable more detailed logging or monitoring for the test environment although first try to reproduce it, then enable logging and reproduce it again maybe Spring allows you to configure monitoring in a way that shows you what happens to each individual request which could be viable for a test environment
Crain
CrainOP2y ago
Yeah, we were finally able to isolate the instance, so we'll investigate that. Thanks for the ideas though, I'll definitely try some of them out
JavaBot
JavaBot2y ago
If you are finished with your post, please close it. If you are not, please ignore this message. Note that you will not be able to send further messages here after this post have been closed but you will be able to create new posts.
JavaBot
JavaBot2y ago
Post Closed
This post has been closed by <@190262684082503680>.

Did you find this page helpful?