[parameterized queries] Increased time in query evaluation when gremlin server starts/restarts

Hi folks, High latency observed in query evaluation time whenever janusgraph server restarts/starts & the latency degradation is there for atleast 5 mins. * I'm using parameterized queries for the janusgraph server requests, So I know there will be some increased latency whenever sever starts/restarts but the issue is this degradation does not go away for atleast 5 mins and the latency for evaluation goes from aoound 300ms to 5k ms. * Janusgraph is deployed in a kubernetes cluster with 20 pods, So everytime I redeploy the janusgraph cluster this issue arises which results in timeouts at client side. Wanted to know if there is some other way to add all the parameterized queries to the cache so that whenever started/restarted janusgraph pod is ready to serve requests all the parametrized query should be already in cache
19 Replies
shivam.choudhary
shivam.choudharyOP2y ago
The time for query evaluation goes upto 30k ms on pod starts/restartsand and remains high for around 5 minutes before settling to previous levels.
No description
shivam.choudhary
shivam.choudharyOP2y ago
The above panel shows response times for janusgraph cluster having 20 janusgraph instances. @KelvinL is there a way this can be avoided? TIA.
spmallette
spmallette2y ago
i'm not sure that i can think of any other way to do this than the obvious one. your application has to send the queries you wish to cache to the server when it starts up. as there is no shared cache, you will need to be sure that the queries are sent to each node in the cluster. the only other thing i can think of that you could try would be to write this pre-caching function into the init script for the server (which would execute on each node) to do it from there. the one trick i can think of is that the init script execution happens before the server is fully ready to accept requests. you'd have to spawn a thread (or maybe start a separate application??) that tests for when the server is fully started and then send all your pre-caching requests.
shivam.choudhary
shivam.choudharyOP2y ago
@spmallette Thanks for the detailed approach, We thought of doing this only at the start but as the approach looks a bit hacky I was checking if there is any standard way of regitering all the parameterized query before using them. Also I observed that whenever a new pod startup the number of current thread increases rapidly from roughly around 28 to 1000, can that be the reason for the increased evaluation time? because the latency we are observing is going upto 5000 ms from mere 50ms.
shivam.choudhary
shivam.choudharyOP2y ago
No description
spmallette
spmallette2y ago
The number of threads shouldn't exceed the size of the gremlinPool settings, unless you're using sessions which technically doesn't have an upper bound at this point. i don't know if the the thread increase is extending your evaluation time. i sense that it's more just query compilation time stacking up on you. switching from scripts to bytecode requests would resolve this caching isssue if that is actually the problem. there is no cache for bytecode. it's simply faster. and if you're very adventurous you could test out the GremlinLangScriptEngine instead of the groovy one. tests are showing it to be the fastest way to process Gremlin. unfortunately it does not yet handle parameteres (there hasn't been a ton of need on that because there is no caching needed as there is for Groovy).
shivam.choudhary
shivam.choudharyOP2y ago
unfortunately it does not yet handle parameteres
We have desgined the architecture on usage of parameters in the query it wont be possible for now to switch from it. btw I profiled the JVM and turns out that these ~1000 threads which are getting created belongs to StandardIDPool
spmallette
spmallette2y ago
We have desgined the architecture on usage of parameters in the query it wont be possible for now to switch from it.
that functionality isn't available yet, but will be soon. there will still be limitations as it seemed important to restrict Gremlin a bit more than Groovy but for standard uses like has('name',x) or property('name',x), which i assume is the kind of thing you're doing, it should work. hoping it lands with 3.7.0 though i'm not sure we'll replace it as a default in Gremlin Server for a long time.
btw I profiled the JVM and turns out that these ~1000 threads which are getting created belongs to StandardIDPool
i don't know if that is normal offhand as that is not a Gremlin Server thread pool. i believe it is a JanusGraph one. does anyone at @janusgraph know if that it is expected to generate that many threads under these conditions?
Bo
Bo2y ago
No that does not sound normal. @shivam.choudhary Did you change ids.num-partitions config to a large number?
shivam.choudhary
shivam.choudharyOP2y ago
@boxuanli I checked it and the value is set as 1024, as the config is fixed for the lifetime of the graph we wanted to have the graph sufficiently partitioned so that we can make partitioned vertex labels for supernodes. But as of now we haven't had the requirement to create partition label, are there issues which we might face due to this down the line? Also we use Bigtable as our storage backend not sure how it can help here as we mostly have 1 or 2 bigtable nodes based on the traffic.
Bo
Bo2y ago
Setting that number to 1024 means you have 1024 threads for StandardIDPool
shivam.choudhary
shivam.choudharyOP2y ago
@spmallette I checked this metric - longRunCompilationCount which gives the count of events where the script compilation time time was more than the expectedCompilationTime (I configured it as 100 milliseconds) but it came out to be 0.* (Actually it was 1 due to a script which gets evaluated on start up by default which took around 2403ms).* This means that the query compilation time in not taking that much time but still the latency we are observing is of the order of ~500ms for few minutes after startup.
spmallette
spmallette2y ago
@shivam.choudhary did the size of the StandardIDPool have anything to do with this problem?
shivam.choudhary
shivam.choudharyOP2y ago
I'm still figuring out a way to eliminate if StandardIDPool have anything to do with this problem. Have tried several ways but nothing so far. As the config which sets the size of the StandardIdPool can not be change that's why it is getting a bit challenging.
Flynnt
Flynnt2y ago
There is a difference between cluster.max-partitions and ids.num-partitions. On startup each janusgraph server instance try to retrieve 1024 block of ids ( of a size define in ids.block-size) for each ids.num-partitions (the parameter name is quite confusing in my opinion). Each block allocation result in a lock in the « id allocation table «  in Janus So in your case, when you restart your instances, the table take 20 x 1024 allocations requests and should lock and verify for each block that it has not been allocated to an other instance. What’s your value for cluster.max-partitions ?
shivam.choudhary
shivam.choudharyOP2y ago
Sorry I had a mixup last time, the value ids.num-partitions is set as 10 and the value cluster.max-partitions is set as 1024
Flynnt
Flynnt2y ago
Have you tryed to start your graph in read only mode ? It would disable the idPool stuff. Can you give us your janusgraph.properties file ? It's really looks like you have a bottleneck on startup with the idBlock retrieving. Did you have change your connection's settings to you backend (number of connections, timeout...) ?
shivam.choudhary
shivam.choudharyOP2y ago
No, we havent change anything related to backend connection but we did change the ids.block-size when we were initially ingesting the data into the graph as the data was huge. Please find the janusgraph.properties below:
properties:
storage.backend: hbase
storage.directory: null
storage.hbase.ext.google.bigtable.instance.id: ##########
storage.hbase.ext.google.bigtable.project.id: ##########
storage.hbase.ext.google.bigtable.app_profile.id: ############
storage.hbase.ext.hbase.client.connection.impl: com.google.cloud.bigtable.hbase2_x.BigtableConnection
storage.hbase.short-cf-names: true
storage.hbase.table: ###########
cache.db-cache: false
cache.db-cache-clean-wait: 20
cache.db-cache-time: 180000
cache.db-cache-size: 0.5
cluster.max-partitions: 1024
graph.replace-instance-if-exists: true
metrics.enabled: true
metrics.jmx.enabled: true
ids.block-size: "1000000"
query.batch: true
query.limit-batch-size: true
schema.constraints: true
schema.default: none
storage.batch-loading: false
storage.hbase.scan.parallelism: 10
properties:
storage.backend: hbase
storage.directory: null
storage.hbase.ext.google.bigtable.instance.id: ##########
storage.hbase.ext.google.bigtable.project.id: ##########
storage.hbase.ext.google.bigtable.app_profile.id: ############
storage.hbase.ext.hbase.client.connection.impl: com.google.cloud.bigtable.hbase2_x.BigtableConnection
storage.hbase.short-cf-names: true
storage.hbase.table: ###########
cache.db-cache: false
cache.db-cache-clean-wait: 20
cache.db-cache-time: 180000
cache.db-cache-size: 0.5
cluster.max-partitions: 1024
graph.replace-instance-if-exists: true
metrics.enabled: true
metrics.jmx.enabled: true
ids.block-size: "1000000"
query.batch: true
query.limit-batch-size: true
schema.constraints: true
schema.default: none
storage.batch-loading: false
storage.hbase.scan.parallelism: 10
Currently I'm working on setting the read only janusgraph instances, will be able to test it out soon with the load which we have on the current janusgraph instances.
Bo
Bo2y ago
I cannot remember the difference between ids.num-partitions and cluster.max-partitions so I might have made a mistake. Sorry about that. Btw I don't think setting read only janusgraph instance per se will make a difference. If you don't create any new data, the JanusGraphID threads won't be created at the first place. So technically it doesn't matter if you set readonly or not. As long as you don't attempt to write data, the JanusGraphID threads won't show up.
Want results from more Discord servers?
Add your server