Tinkerpop Server OOM

Hi Tinkerpop team, I'm trying to make sense of this OOMing that seems to consistently occur in my environment over the course of usually a couple hours. Attached is a screenshot of the JVM GC behavior metrics showing before & after a GC. It's almost like the underlying live memory continues to grow but I'm not sure why. Reviewing a heap dump from a different OOM showed that about 8.3GB was consumed by Netty's NioSocketChannels, but drilled deeper seems like it is instances of org.apache.tinkerpop.gremlin.process.traversal.Bytecode Which got me wondering, is there some kind of "close" clients are supposed to send? I'm using an unofficial Rust gremlin driver and I'm just wondering if it's missing some house keeping that's causing my JanusGraph instance to accumulate unclosed resources until it dies. The client is sending up bytecode based traversals using GraphSON V3 and my understanding is what the Tinkerpop Server is supposed to receive these, execute them, and then send a response back (if needed) and then "that's it". Based on the heap dump I'm assuming that is congruent with seeing SingleTaskSession instances on the JG side. The 43k SingleTaskSessions in the heap dump was unexpected. My client application at the moment should only have at most around 12 connections and the Rust driver library doesn't appear to multiplex the connections to allow multiple requests to go out on the same connection concurrently. OTOH I noticed these seemed to be under a Netty Channel's "CloseFuture". It seems unlikely but is it possible I'm submiting traversal requests faster than they can be cleaned up? If so, is there a configuration setting to turn that up? I'm aware of the gremlinPool and have that turned up. I tried changing threadPoolWorker but that didn't seem to change things either.
No description
No description
Solution:
Sorry for the delayed response. I'll try to take a look at this soon. But for now, I just wanted to point out that SingleTaskSession and the like are part of the UnifiedChannelizer. From what I remember, the UnifiedChannelizer isn't quite production ready, and in fact is being removed in the next major version of TinkerPop. We can certainly still make bug/performance fixes to this part of the code for 3.7.x though.
Jump to solution
8 Replies
criminosis
criminosisOP2mo ago
If it's helpful traversals are mostly upserts. I build a list, where each element is a map of maps, and each map corresponds to being selected later on. The list is then injected into the traversal and then unfolded. That then unfolds the map of maps into the traversal, and then one map is selected for a mergeV step, and then the returned vertex is given an alias. I then drop all the vertex's properties (so I can have on the vertex exactly what was in the injected map). Then the other maps are then selected out for setting properties as side effects depending on their needed cardinalities, and then lastly a map is selected out that indicates edges to draw to other vertices, and properties to set on that edge.
client
.get_traversal()
.inject(upsert_batch)
//Unfold each vertex's payload into the object stream (individually)
.unfold()
.as_("payload")
//Upsert the vertex based on the given lookup map
.merge_v(__.select("lookup"))
.as_("v")
//Drop all properties that were there, we'll want to assign everything that is given
.side_effect(__.properties(()).drop())
//First up are the simple single cardinality properties, just assign each on that is given
.side_effect(build_single_prop_side_effect())
//Next are the set properties. Similar to the single cardinality properties but we have to unfold the nested collection of values
.side_effect(build_set_prop_side_effect())
client
.get_traversal()
.inject(upsert_batch)
//Unfold each vertex's payload into the object stream (individually)
.unfold()
.as_("payload")
//Upsert the vertex based on the given lookup map
.merge_v(__.select("lookup"))
.as_("v")
//Drop all properties that were there, we'll want to assign everything that is given
.side_effect(__.properties(()).drop())
//First up are the simple single cardinality properties, just assign each on that is given
.side_effect(build_single_prop_side_effect())
//Next are the set properties. Similar to the single cardinality properties but we have to unfold the nested collection of values
.side_effect(build_set_prop_side_effect())
The two utility methods in the side effects are below
fn build_single_prop_side_effect() -> TraversalBuilder {
__.select("payload")
.select("single_props")
//This unfolds each of the single cardinality properties into their key value pairs into the object stream
.unfold()
.as_("kv")
.select("v")
.property(
//These pull apart the key value pair in the object stream as parameters to property()
__.select("kv").by(Column::Keys),
__.select("kv").by(Column::Values),
)
}
fn build_single_prop_side_effect() -> TraversalBuilder {
__.select("payload")
.select("single_props")
//This unfolds each of the single cardinality properties into their key value pairs into the object stream
.unfold()
.as_("kv")
.select("v")
.property(
//These pull apart the key value pair in the object stream as parameters to property()
__.select("kv").by(Column::Keys),
__.select("kv").by(Column::Values),
)
}
fn build_set_prop_side_effect() -> TraversalBuilder {
__.select("payload")
.select("set_props")
//This unfolds into the object stream "key":["value",...]
.unfold()
.as_("kvals")
//This separates out the individual values into the object stream (without the key)
.select("kvals")
.by(Column::Values)
.unfold()
.as_("value")
//now select the vertex again back into the object stream and apply these as separate entries
.select("v")
.property_with_cardinality(
Cardinality::Set,
//This pulls the key in from further up the object stream
__.select("kvals").by(Column::Keys),
//This pulls in the value that was before the vertex in the object stream
__.select("value"),
)
}
fn build_set_prop_side_effect() -> TraversalBuilder {
__.select("payload")
.select("set_props")
//This unfolds into the object stream "key":["value",...]
.unfold()
.as_("kvals")
//This separates out the individual values into the object stream (without the key)
.select("kvals")
.by(Column::Values)
.unfold()
.as_("value")
//now select the vertex again back into the object stream and apply these as separate entries
.select("v")
.property_with_cardinality(
Cardinality::Set,
//This pulls the key in from further up the object stream
__.select("kvals").by(Column::Keys),
//This pulls in the value that was before the vertex in the object stream
__.select("value"),
)
}
And then a follow-up mutation to the traversal adds the edge creation portion
traversal.side_effect(
__.select("payload")
.select("in_edges")
.unfold()
.as_("edge_map")
.merge_e(__.select("edge_map"))
.option((Merge::InV, __.select("v")))
.property("last_modified_timestamp", Utc::now()),
)
traversal.side_effect(
__.select("payload")
.select("in_edges")
.unfold()
.as_("edge_map")
.merge_e(__.select("edge_map"))
.option((Merge::InV, __.select("v")))
.property("last_modified_timestamp", Utc::now()),
)
criminosis
criminosisOP2mo ago
So I spent some more time reviewing this and I noticed something when I started tracing GC routes. It seems like the only thing keeping these around is their close futures
No description
criminosis
criminosisOP2mo ago
Looking at, what I believe is the relevant code it does seem like that does line up https://github.com/apache/tinkerpop/blob/b7c9ddda16a3d059b2a677f578f131a7124187b6/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin/server/handler/AbstractSession.java#L166-L170 In that until the channel closes, there doesn't (seem?) to be something that removes the listener for the session from the channel's close future listener list. And that listener holds a reference to the session and transitively, the Bytecode that seems to be building up and OOMing me.
GitHub
tinkerpop/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
criminosis
criminosisOP2mo ago
Which seems like if the channel doesn't close, these should accumulate. The Rust library does maintain a pool of websocket connections so that does seem to line up with the behavior I'd expect. It purely does just serialize the request, send it up the channel, receive the response, and then puts the websocket connection back in the pool. But when I try to reproduce this behavior using a toy Java application & the Java driver the SingleTaskSessions don't seem to accumulate in the same way. If any Tinkerpop people see this thread, I'd be curious if there's any hypothesises with regards to the 43,536 single tasks sessions in the heap dump. I haven't changed maxWorkQueueSize from its default of 8192 which seems like it should have applied backpressure here if I was somehow spawning too many concurrent sessions. 🤔 But in the meantime continuing to try to reproduce this locally outside my staging environment. So far that's the only place the OOMs seem to be happening, but other than slower writes to backing datastores there shouldn't be significant differences. It runs within the same container image as I use in my staging environment.
Solution
Kennh
Kennh2mo ago
Sorry for the delayed response. I'll try to take a look at this soon. But for now, I just wanted to point out that SingleTaskSession and the like are part of the UnifiedChannelizer. From what I remember, the UnifiedChannelizer isn't quite production ready, and in fact is being removed in the next major version of TinkerPop. We can certainly still make bug/performance fixes to this part of the code for 3.7.x though.
criminosis
criminosisOP2mo ago
Gotcha, I was a couple weeks into performance tuning in my Rust app when this started happening. I'll try it with what I was using previously (WsAndHttpChannelizer) and see if it's different. 👍 I saw it was teeing up as the default channelizer, thought maybe it was ready to try out, but sounds like maybe I jumped the gun 😅 FTR you got it @Kennh . I totally forgot to replicate that locally. I changed the channelizer a long time ago in my deployment via environment variable 🤦‍♂️. But fwiw I'm able to reproduce the OOM now locally with both my Rust and toy Java app. I guess are next steps to write up an issue report on the Jira after making an account @Kennh? I wouldn't be opposed to taking a stab at trying to fix it either assuming my theory is correct that we're missing a listener removal for the happy path to remove the cancelation future.
Kennh
Kennh2mo ago
I guess are next steps to write up an issue report on the Jira after making an account
Yea, that would be good for tracking.
I wouldn't be opposed to taking a stab at trying to fix it either assuming my theory is correct that we're missing a listener removal for the happy path to remove the cancelation future.
We're always looking for contributions so that would be great. That being said, I'm a bit concerned about this particular issue though since the channel holding references to results definitely shouldn't be happening. Might require a bigger change to properly fix it, which might not be worth doing at this point.
criminosis
criminosisOP2mo ago
Makes sense. Opened an issue: https://issues.apache.org/jira/browse/TINKERPOP-3113
since the channel holding references to results definitely shouldn't be happening.
Fwiw to be clear it isn't holding a reference to the result, or at least result as the output if I understand correctly. It's holding a reference to the session task that should be "done" and presumably GC'ed. But the CloseListener on the channel is given a reference to it and the listener is never removed from what I can tell.
Want results from more Discord servers?
Add your server