Java Community | Help. Code. Learn.•2y ago

Dangling Execution Issue

Been having a weird prod issue, so I want to see if anyone has some bright ideas. We run a load balanced cluster (3/3) for a Spring app, and recently if we change Traffic from A -> B, then we get a random spike in DB query times, even if there are no DB changes in the B Cluster's Code. Does Spring have any issues where a connection isn't closed if the HTTP Session/Request is suddenly killed? For example a Load Balancer changing traffic? Since the normal fix is to Cycle the servers.

46 Replies

JavaBot•2y ago

⌛ This post has been reserved for your question.

Hey @Crain! Please use /close or the Close Post button above when you're finished. Please remember to follow the help guidelines. This post will be automatically closed after 300 minutes of inactivity.

TIP: Narrow down your issue to simple and precise questions to maximize the chance that others will reply in here.

straightface•2y ago

are you sure this just isnt due to cache misses on the server B

CrainOP•2y ago

I'm positive.

dan1st•2y ago

a random spike in DB query times

on the DB side or on the Spring side? and which server receives the high DB query times? What do you mean with dangling exceptions? And did it only happen once or does it happen regularly?

CrainOP•2y ago

Query Time spike. So assumed DB execution, could be Spring side. Restarting one specific server fixed the issue, but the times could have leaked to others. The dangling execution was mostly "Could this have been caused by a connection not being freed, causing a Table Lock" It's happened one a week for the past 2 weeks. Only on traffic swaps.

dan1st•2y ago

Do you have any information where exactly this occured? e.g. which endpoint, which SQL statement, which repository method, etc? Do you have any DB-side metrics showing you response times etc?

CrainOP•2y ago

No DB side metrics, that I know of at least. We do have the endpoints/tables though. I'm thinking we just have to optimize those, or toss in some No-Locks :/

dan1st•2y ago

Is the problem happening in the processing of an endpoint? if so, is it always the same endpoint? Or always one of a few endpoints? same with SQL queries/tables/repository methods

CrainOP•2y ago

We're not sure, these endpoints are just the main ones associated with the increased query times. So not sure if it's the cause, or a side-effect. But I'd assumed a mixture of both.

dan1st•2y ago

Did you try setting up the same load balancer with the same services locally?

CrainOP•2y ago

It's an Azure Loadbalancer, so I can't set it up locally. But I run the service locally basically everyday, since it's the main application I work with.

dan1st•2y ago

Is it actually a pure Spring/JPA exception or does it come from (caused by section) something else (e.g. the DB)?

CrainOP•2y ago

It's not an exception, it's just a massive query time spike. Although to be fair... I don't know if we've configured a query timeout

dan1st•2y ago

with query time, you mean the time per query, right?

CrainOP•2y ago

Correct

dan1st•2y ago

I assume all services go to the same DB?

CrainOP•2y ago

Correct

dan1st•2y ago

and the load balancer is only about the connection between clients and the services and the connection between services and DB is direct?

CrainOP•2y ago

Only Client -> Server, we directly connect to the DB, correct.

dan1st•2y ago

are the query spikes averages?

CrainOP•2y ago

Yes Not sure if we can get the individuals, that would be a good idea though

dan1st•2y ago

Which server are the query spikes occuring at? The server that is switched from or the server which is switched to?

CrainOP•2y ago

Switched to

dan1st•2y ago

maybe you can get medians and/or quantiles

CrainOP•2y ago

Which we found out after I made this post, to be clear, so "dangling execution" is incorrect Oooo, quartiles are a good idea

dan1st•2y ago

How massive are the spikes? Double the request time or is it more?

CrainOP•2y ago

Feedback loop style. You can see where we rebooted the bad server.

dan1st•2y ago

and where did you switch?

CrainOP•2y ago

dan1st•2y ago

where in the graph did you switch the services? like where exactly did it start receiving more load?

CrainOP•2y ago

At the spike, it reacted within a minute or two Like it hit fast, I mostly showed before to include the Y Axis info*

dan1st•2y ago

you switched from service A to service B where on the x-axis did this happen?

CrainOP•2y ago

Where the graph spikes

CrainOP•2y ago

It was literally within a minute of the switch.

dan1st•2y ago

ok so at the beginning Do you have some sort of session affinity configured?

CrainOP•2y ago

Through the load balancer.

dan1st•2y ago

Do you have some sort of test environment?

CrainOP•2y ago

Several

dan1st•2y ago

which is also using that load balancer?

CrainOP•2y ago

dan1st•2y ago

Is it viable to create a copy of that load balancer (maybe with an internal IP or whatever) for a test environment?

CrainOP•2y ago

It's an Azure Gateway, so it's easy enough. We have a copy we use for one of the test environments

dan1st•2y ago

idk Azure Gateway maybe try getting one test environment to be similar to the production environment and then do the switch that causes problems and maybe enable more detailed logging or monitoring for the test environment although first try to reproduce it, then enable logging and reproduce it again maybe Spring allows you to configure monitoring in a way that shows you what happens to each individual request which could be viable for a test environment

CrainOP•2y ago

Yeah, we were finally able to isolate the instance, so we'll investigate that. Thanks for the ideas though, I'll definitely try some of them out

JavaBot•2y ago

If you are finished with your post, please close it. If you are not, please ignore this message. Note that you will not be able to send further messages here after this post have been closed but you will be able to create new posts.

JavaBot•2y ago

Post Closed

This post has been closed by <@190262684082503680>.

Gaming

Programming

Dangling Execution Issue

Did you find this page helpful?