Tinkerpop Server OOM
Hi Tinkerpop team,
I'm trying to make sense of this OOMing that seems to consistently occur in my environment over the course of usually a couple hours.
Attached is a screenshot of the JVM GC behavior metrics showing before & after a GC. It's almost like the underlying live memory continues to grow but I'm not sure why.
Reviewing a heap dump from a different OOM showed that about 8.3GB was consumed by Netty's NioSocketChannels, but drilled deeper seems like it is instances of
org.apache.tinkerpop.gremlin.process.traversal.Bytecode
Which got me wondering, is there some kind of "close" clients are supposed to send?
I'm using an unofficial Rust gremlin driver and I'm just wondering if it's missing some house keeping that's causing my JanusGraph instance to accumulate unclosed resources until it dies.
The client is sending up bytecode based traversals using GraphSON V3 and my understanding is what the Tinkerpop Server is supposed to receive these, execute them, and then send a response back (if needed) and then "that's it". Based on the heap dump I'm assuming that is congruent with seeing SingleTaskSession instances on the JG side.
The 43k SingleTaskSessions in the heap dump was unexpected. My client application at the moment should only have at most around 12 connections and the Rust driver library doesn't appear to multiplex the connections to allow multiple requests to go out on the same connection concurrently.
OTOH I noticed these seemed to be under a Netty Channel's "CloseFuture". It seems unlikely but is it possible I'm submiting traversal requests faster than they can be cleaned up? If so, is there a configuration setting to turn that up? I'm aware of the gremlinPool
and have that turned up. I tried changing threadPoolWorker
but that didn't seem to change things either.Solution:Jump to solution
Sorry for the delayed response. I'll try to take a look at this soon. But for now, I just wanted to point out that SingleTaskSession and the like are part of the UnifiedChannelizer. From what I remember, the UnifiedChannelizer isn't quite production ready, and in fact is being removed in the next major version of TinkerPop. We can certainly still make bug/performance fixes to this part of the code for 3.7.x though.
8 Replies
If it's helpful traversals are mostly upserts.
I build a list, where each element is a map of maps, and each map corresponds to being selected later on. The list is then injected into the traversal and then unfolded. That then unfolds the map of maps into the traversal, and then one map is selected for a
mergeV
step, and then the returned vertex is given an alias. I then drop all the vertex's properties (so I can have on the vertex exactly what was in the injected map). Then the other maps are then select
ed out for setting properties as side effects depending on their needed cardinalities, and then lastly a map is selected
out that indicates edges to draw to other vertices, and properties to set on that edge.
The two utility methods in the side effects are below
And then a follow-up mutation to the traversal adds the edge creation portion
So I spent some more time reviewing this and I noticed something when I started tracing GC routes. It seems like the only thing keeping these around is their close futures
Looking at, what I believe is the relevant code it does seem like that does line up https://github.com/apache/tinkerpop/blob/b7c9ddda16a3d059b2a677f578f131a7124187b6/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin/server/handler/AbstractSession.java#L166-L170
In that until the channel closes, there doesn't (seem?) to be something that removes the listener for the session from the channel's close future listener list. And that listener holds a reference to the session and transitively, the Bytecode that seems to be building up and OOMing me.
GitHub
tinkerpop/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
Which seems like if the channel doesn't close, these should accumulate. The Rust library does maintain a pool of websocket connections so that does seem to line up with the behavior I'd expect. It purely does just serialize the request, send it up the channel, receive the response, and then puts the websocket connection back in the pool.
But when I try to reproduce this behavior using a toy Java application & the Java driver the SingleTaskSessions don't seem to accumulate in the same way.
If any Tinkerpop people see this thread, I'd be curious if there's any hypothesises with regards to the 43,536 single tasks sessions in the heap dump. I haven't changed
maxWorkQueueSize
from its default of 8192 which seems like it should have applied backpressure here if I was somehow spawning too many concurrent sessions. 🤔
But in the meantime continuing to try to reproduce this locally outside my staging environment. So far that's the only place the OOMs seem to be happening, but other than slower writes to backing datastores there shouldn't be significant differences. It runs within the same container image as I use in my staging environment.Solution
Sorry for the delayed response. I'll try to take a look at this soon. But for now, I just wanted to point out that SingleTaskSession and the like are part of the UnifiedChannelizer. From what I remember, the UnifiedChannelizer isn't quite production ready, and in fact is being removed in the next major version of TinkerPop. We can certainly still make bug/performance fixes to this part of the code for 3.7.x though.
Gotcha, I was a couple weeks into performance tuning in my Rust app when this started happening.
I'll try it with what I was using previously (WsAndHttpChannelizer) and see if it's different. 👍
I saw it was teeing up as the default channelizer, thought maybe it was ready to try out, but sounds like maybe I jumped the gun 😅
FTR you got it @Kennh . I totally forgot to replicate that locally. I changed the channelizer a long time ago in my deployment via environment variable 🤦♂️.
But fwiw I'm able to reproduce the OOM now locally with both my Rust and toy Java app.
I guess are next steps to write up an issue report on the Jira after making an account @Kennh?
I wouldn't be opposed to taking a stab at trying to fix it either assuming my theory is correct that we're missing a listener removal for the happy path to remove the cancelation future.
I guess are next steps to write up an issue report on the Jira after making an accountYea, that would be good for tracking.
I wouldn't be opposed to taking a stab at trying to fix it either assuming my theory is correct that we're missing a listener removal for the happy path to remove the cancelation future.We're always looking for contributions so that would be great. That being said, I'm a bit concerned about this particular issue though since the channel holding references to results definitely shouldn't be happening. Might require a bigger change to properly fix it, which might not be worth doing at this point.
Makes sense. Opened an issue: https://issues.apache.org/jira/browse/TINKERPOP-3113
since the channel holding references to results definitely shouldn't be happening.Fwiw to be clear it isn't holding a reference to the result, or at least result as the output if I understand correctly. It's holding a reference to the session task that should be "done" and presumably GC'ed. But the CloseListener on the channel is given a reference to it and the listener is never removed from what I can tell.