Comma Separated Config Options Values Via Environment Variable?
Hey @Bo & @porunov I was trying out the retry logic added by my PR (https://github.com/JanusGraph/janusgraph/pull/4409) by setting environment variables in my docker compose like so:
janusgraph.index.some_index_name_here.elasticsearch.retry-error-codes=408,429
Much like was done in the unit test I wrote (https://github.com/JanusGraph/janusgraph/blob/487e10ca276678862fd8fb369d6d17188703ba67/janusgraph-es/src/test/java/org/janusgraph/diskstorage/es/rest/RestClientSetupTest.java#L240). But to my surprise it doesn't seem to be received well during startup:
janusgraph | Caused by: java.lang.NumberFormatException: For input string: "408,429" janusgraph | at java.base/java.lang.NumberFormatException.forInputString(Unknown Source) ~[?:?] ..... janusgraph | at org.janusgraph.diskstorage.es.rest.RestClientSetup.connect(RestClientSetup.java:83) ~[janusgraph-es-1.1.0-20240427-134121.4bddfb4.jar:?]The surprise to me being it's using the same
String[]
parsing logic as other commands, which got me curious if others didn't work either. So I tried a more mundane one like
janusgraph.index.some_index_name_here.hostname=elasticsearch,elasticsearch
And it too doesn't seem to get parsed correctly:
janusgraph | Caused by: java.net.UnknownHostException: elasticsearch,elasticsearch janusgraph | at java.base/java.net.InetAddress$CachedAddresses.get(Unknown Source) ~[?:?] .... janusgraph | at org.elasticsearch.client.RestClient.performRequest(RestClient.java:288) ~[elasticsearch-rest-client-8.10.4.jar:8.10.4] janusgraph | at org.janusgraph.diskstorage.es.rest.RestElasticSearchClient.clusterHealthRequest(RestElasticSearchClient.java:171) ~[janusgraph-es-1.1.0-20240427-134121.4bddfb4.jar:?]So just wondered is there a special syntax that's intended here?
Solution:Jump to solution
Okay, after quite a journey of vetting the parsing behavior from environment variable into property files, into the various Confirmation implementations into ConfigOptions, etc etc I've found where this has gone sideways....and like the worst of these types of situations looks like this one was self inflicted 🤦♂️ .
A long long long time ago when I was starting on working with JanusGraph I was tired of JanusGraph deployments that silently failed and leaving my container up but dead inside the container, for reasons that weren't its fault (started up before Cassandra was ready timing out its storage wait time, etc).
Trying to find solutions for that I stumbled upon the Tinkerpop CheckGraphManager (https://github.com/apache/tinkerpop/blob/master/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin/server/util/CheckedGraphManager.java) that would automatically terminate the process if the specified graphs did not successfully open....
GitHub
tinkerpop/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
GitHub
janusgraph/janusgraph-core/src/main/java/org/janusgraph/graphdb/man...
JanusGraph: an open-source, distributed graph database - JanusGraph/janusgraph
22 Replies
I found the env var parsing in the docker startup script (https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-dist/docker/docker-entrypoint.sh#L41-L49)
but it seems like it should be faithfully passing things through: https://github.com/JanusGraph/janusgraph/blob/487e10ca276678862fd8fb369d6d17188703ba67/janusgraph-dist/docker/docker-entrypoint.sh#L48
Is there any chance you could try it out without using docker?
But Docker is provided as an official way to deploy JanusGraph, so shouldn't we be able to do this through Docker?
yeah I am just wondering if there's something wrong with our Docker image
I've never seen this bug before - but I've never used Docker to deploy JanusGraph
Solution
Okay, after quite a journey of vetting the parsing behavior from environment variable into property files, into the various Confirmation implementations into ConfigOptions, etc etc I've found where this has gone sideways....and like the worst of these types of situations looks like this one was self inflicted 🤦♂️ .
A long long long time ago when I was starting on working with JanusGraph I was tired of JanusGraph deployments that silently failed and leaving my container up but dead inside the container, for reasons that weren't its fault (started up before Cassandra was ready timing out its storage wait time, etc).
Trying to find solutions for that I stumbled upon the Tinkerpop CheckGraphManager (https://github.com/apache/tinkerpop/blob/master/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin/server/util/CheckedGraphManager.java) that would automatically terminate the process if the specified graphs did not successfully open.
So I switched to that graph manager like so:
ENV gremlinserver.graphManager=org.apache.tinkerpop.gremlin.server.util.CheckedGraphManager
But this has the unintended consequence of removing the configuration parsing handling that is afforded by JanusGraphManager
(https://github.com/JanusGraph/janusgraph/blob/487e10ca276678862fd8fb369d6d17188703ba67/janusgraph-core/src/main/java/org/janusgraph/graphdb/management/JanusGraphManager.java#L73), like handling String arrays config options 🤦♂️ .
Would there be appetite for like a CheckJanusGraphManager
that would extend JanusGraphManager
and overlay a duplication of logic that the Tinkerpop CheckGraphManager performs? I could attempt to put that up. It's been handled in my k8s environment so the container doesn't just hang around as a faster version of waiting for probes to fail.GitHub
tinkerpop/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
GitHub
janusgraph/janusgraph-core/src/main/java/org/janusgraph/graphdb/man...
JanusGraph: an open-source, distributed graph database - JanusGraph/janusgraph
But such a graph manager would also only work for problems during startup, right?
Doesn't our
wait for storage
logic during the startup of the Docker container also handle this problem?
https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-dist/docker/docker-entrypoint.sh#L71GitHub
janusgraph/janusgraph-dist/docker/docker-entrypoint.sh at master · ...
JanusGraph: an open-source, distributed graph database - JanusGraph/janusgraph
I was also wanting the container to die if it failed to successfully apply the schema I had defined for it to load up on start.
I see, that's of course a different requirement
Doesn't it take quite some time to load the schema every time a container starts?
Most of the time the schema should already be fully loaded, no?
Of course, no argument there. But it's what drew me to the checked graph manager, albeit at the cost of unwittingly opting out of the JG add-ons with its graph manager.
Hence asking if there would be appetite for a PR that'd add a "Checked"JanusGraphManager
(Just trying to better understand your use case here)
The scripts check if the schema was already loaded and skips doing it again.
Each script is versioned and writes a "completed version X" indicator into the graph
Loosely inspired by liquibase
A problem I'm seeing with that approach is that schema changes require collaboration of all running JanusGraph instances which is hard to guarantee if you're automatically applying them every time a container starts.
At my company we are only applying schema changes manually and stop all other JanusGraph instances and also all clients so only a single instance is up and running where we then execute the schema script.
But in general, schema management is really a hard topic with JanusGraph which leads to a lot of headaches and it's definitely an area where JanusGraph could be improved a lot
Well, to be clear I'm not doing it in the containers that are running JG for purposes of hosting, it's done in a secondary loader container that runs just once at the deployment start
Ah, I see, then it even sounds similar to what we're doing
But now that I'm retracing through the steps, I automated this a long time ago so it's gotten a little dusty in my memory, I may have conflated an issue here
The reason I did it wasn't for schema
I know that problem 😅
Yeah, I ended up doing something similar with our hold HBase system. It feels fairly common place to rebuild these tools 😅
The storage timescript does watch for cassandra to be up, but IIRC it would just timeout but not kill the container? So the graph would attempt to open, fail, and then JG would just hang out without any graphs being opened.
I guess if such a graph manager is useful for you, then it will probably also be useful for others. As a first step you could create an issue for this where you describe what that graph manager does differently and what benefits it brings: https://github.com/JanusGraph/janusgraph/issues/new?assignees=&labels=&projects=&template=feature-request.md
You could then wait a bit to see whether it gets any reactions and then provide a PR
GitHub
Build software better, together
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
Vs what I wanted was the container to die in that case. Which the checked graph manager would do.
With storage timescript you mean the
docker-entrypoint.sh
logic I mentioned earlier? If I'm not completely mistaken, then that should let the startup fail before even starting JanusGraph Server which means that the container won't get up and running at all
It first tries to open the graph locally in the Gremlin Console via JanusGraphFactory.open()
with your config and only if that works, it will start JanusGraph Server with the same configI'll have to reproduce the scenario before I create the feature request.
TBH this was a change I made over a year ago so the details have fallen into the memory aether if I'm being honest 😅
We also sharpened other k8s infra with liveliness probes since then so we may have addressed this issue, at least for ourselves, via other means.
Looking at my commit message for the change it looks like I described it as so:
Kill JG if the graph fails to open instead of idly be hosting no graphsSo it seems I'm correctly remembering the symptom, but having defined steps for reproduction would be beneficial for the feature request