[parameterized queries] Increased time in query evaluation when gremlin server starts/restarts
Hi folks,
High latency observed in query evaluation time whenever janusgraph server restarts/starts & the latency degradation is there for atleast 5 mins.
* I'm using parameterized queries for the janusgraph server requests, So I know there will be some increased latency whenever sever starts/restarts but the issue is this degradation does not go away for atleast 5 mins and the latency for evaluation goes from aoound 300ms to 5k ms.
* Janusgraph is deployed in a kubernetes cluster with 20 pods, So everytime I redeploy the janusgraph cluster this issue arises which results in timeouts at client side.
Wanted to know if there is some other way to add all the parameterized queries to the cache so that whenever started/restarted janusgraph pod is ready to serve requests all the parametrized query should be already in cache
19 Replies
The time for query evaluation goes upto 30k ms on pod starts/restartsand and remains high for around 5 minutes before settling to previous levels.
The above panel shows response times for janusgraph cluster having 20 janusgraph instances.
@KelvinL is there a way this can be avoided? TIA.
i'm not sure that i can think of any other way to do this than the obvious one. your application has to send the queries you wish to cache to the server when it starts up. as there is no shared cache, you will need to be sure that the queries are sent to each node in the cluster. the only other thing i can think of that you could try would be to write this pre-caching function into the init script for the server (which would execute on each node) to do it from there. the one trick i can think of is that the init script execution happens before the server is fully ready to accept requests. you'd have to spawn a thread (or maybe start a separate application??) that tests for when the server is fully started and then send all your pre-caching requests.
@spmallette Thanks for the detailed approach, We thought of doing this only at the start but as the approach looks a bit hacky I was checking if there is any standard way of regitering all the parameterized query before using them.
Also I observed that whenever a new pod startup the number of current thread increases rapidly from roughly around 28 to 1000, can that be the reason for the increased evaluation time? because the latency we are observing is going upto 5000 ms from mere 50ms.
The number of threads shouldn't exceed the size of the gremlinPool settings, unless you're using sessions which technically doesn't have an upper bound at this point. i don't know if the the thread increase is extending your evaluation time. i sense that it's more just query compilation time stacking up on you. switching from scripts to bytecode requests would resolve this caching isssue if that is actually the problem. there is no cache for bytecode. it's simply faster. and if you're very adventurous you could test out the
GremlinLangScriptEngine
instead of the groovy one. tests are showing it to be the fastest way to process Gremlin. unfortunately it does not yet handle parameteres (there hasn't been a ton of need on that because there is no caching needed as there is for Groovy).unfortunately it does not yet handle parameteresWe have desgined the architecture on usage of parameters in the query it wont be possible for now to switch from it. btw I profiled the JVM and turns out that these ~1000 threads which are getting created belongs to
StandardIDPool
We have desgined the architecture on usage of parameters in the query it wont be possible for now to switch from it.that functionality isn't available yet, but will be soon. there will still be limitations as it seemed important to restrict Gremlin a bit more than Groovy but for standard uses like
has('name',x)
or property('name',x)
, which i assume is the kind of thing you're doing, it should work. hoping it lands with 3.7.0 though i'm not sure we'll replace it as a default in Gremlin Server for a long time.
btw I profiled the JVM and turns out that these ~1000 threads which are getting created belongs to StandardIDPooli don't know if that is normal offhand as that is not a Gremlin Server thread pool. i believe it is a JanusGraph one. does anyone at @janusgraph know if that it is expected to generate that many threads under these conditions?
No that does not sound normal. @shivam.choudhary Did you change
ids.num-partitions
config to a large number?@boxuanli I checked it and the value is set as 1024, as the config is fixed for the lifetime of the graph we wanted to have the graph sufficiently partitioned so that we can make partitioned vertex labels for supernodes.
But as of now we haven't had the requirement to create partition label, are there issues which we might face due to this down the line?
Also we use Bigtable as our storage backend not sure how it can help here as we mostly have 1 or 2 bigtable nodes based on the traffic.
Setting that number to 1024 means you have 1024 threads for StandardIDPool
@spmallette I checked this metric -
longRunCompilationCount
which gives the count of events where the script compilation time time was more than the expectedCompilationTime
(I configured it as 100 milliseconds) but it came out to be 0.* (Actually it was 1 due to a script which gets evaluated on start up by default which took around 2403ms).*
This means that the query compilation time in not taking that much time but still the latency we are observing is of the order of ~500ms for few minutes after startup.@shivam.choudhary did the size of the StandardIDPool have anything to do with this problem?
I'm still figuring out a way to eliminate if StandardIDPool have anything to do with this problem. Have tried several ways but nothing so far.
As the config which sets the size of the StandardIdPool can not be change that's why it is getting a bit challenging.
There is a difference between cluster.max-partitions and ids.num-partitions. On startup each janusgraph server instance try to retrieve 1024 block of ids ( of a size define in ids.block-size) for each ids.num-partitions (the parameter name is quite confusing in my opinion). Each block allocation result in a lock in the « id allocation table « in Janus So in your case, when you restart your instances, the table take 20 x 1024 allocations requests and should lock and verify for each block that it has not been allocated to an other instance. What’s your value for cluster.max-partitions ?
Sorry I had a mixup last time, the value
ids.num-partitions
is set as 10
and the value cluster.max-partitions
is set as 1024
Have you tryed to start your graph in read only mode ? It would disable the idPool stuff.
Can you give us your janusgraph.properties file ?
It's really looks like you have a bottleneck on startup with the idBlock retrieving.
Did you have change your connection's settings to you backend (number of connections, timeout...) ?
No, we havent change anything related to backend connection but we did change the
ids.block-size
when we were initially ingesting the data into the graph as the data was huge. Please find the janusgraph.properties below:
Currently I'm working on setting the read only janusgraph instances, will be able to test it out soon with the load which we have on the current janusgraph instances.I cannot remember the difference between
ids.num-partitions
and cluster.max-partitions
so I might have made a mistake. Sorry about that.
Btw I don't think setting read only janusgraph instance per se will make a difference. If you don't create any new data, the JanusGraphID threads won't be created at the first place. So technically it doesn't matter if you set readonly or not. As long as you don't attempt to write data, the JanusGraphID threads won't show up.