Setting index in gremlin-python

I was trying to create vote graph from tutorial on loading data in gremlin-python and afaik you can't simply add index from non-JVM languages because for example there is no TinkerGraph that you could .open(). I don't know how better is performance when having index on 'userId' but my code simply takes too long go through queries from vote file. I tried using client functionality
ws_url = 'ws://localhost:8182/gremlin'

# Create index on userId
client = Client(ws_url, 'g')
client.submit('graph = TinkerGraph.open()')
client.submit("graph.createIndex('userId', Vertex.class)")
client.close()

conn = DriverRemoteConnection(ws_url, 'g')
g = traversal().with_remote(conn)
ws_url = 'ws://localhost:8182/gremlin'

# Create index on userId
client = Client(ws_url, 'g')
client.submit('graph = TinkerGraph.open()')
client.submit("graph.createIndex('userId', Vertex.class)")
client.close()

conn = DriverRemoteConnection(ws_url, 'g')
g = traversal().with_remote(conn)
to do it from string query and i'm not sure if with_remote(conn) uses previously assigned graph, let me know how to do it correctly. I'm not sure how to assign to g from client.submit(...). Additionally: how does one speed up those queries, if setting index won't do it? In my implementation
def idToNode(g: GraphTraversalSource, id: str):
return g.V().has('user', 'userId', id) \
.fold() \
.coalesce(__.unfold(),
__.add_v('user').property('userId', id)) \
.next()

def loadVotes():
with open("/tmp/wiki-Vote.txt", "r") as file:
for _ in range(4):
next(file)

for line in file:
ids = line.split('\t')
from_node = idToNode(g, ids[0])
to_node = idToNode(g, ids[1])
g.add_e('votesFor').from_(from_node).to(to_node).iterate()
def idToNode(g: GraphTraversalSource, id: str):
return g.V().has('user', 'userId', id) \
.fold() \
.coalesce(__.unfold(),
__.add_v('user').property('userId', id)) \
.next()

def loadVotes():
with open("/tmp/wiki-Vote.txt", "r") as file:
for _ in range(4):
next(file)

for line in file:
ids = line.split('\t')
from_node = idToNode(g, ids[0])
to_node = idToNode(g, ids[1])
g.add_e('votesFor').from_(from_node).to(to_node).iterate()
call to idToNode for each line takes too long.
Solution:
if you simply edit that line of code to create your index and load your vote data, every time you start Gremlin Server it will have that all setup and ready to go.
Jump to solution
30 Replies
spmallette
spmallette•6mo ago
TinkerPop doesn't generalize APIs for indices. You have to rely on the APIs provided by the graph itself to do that. As there is no general API for indices, programming language support for Gremlin like python, .NET, javascript and others don't offer any such access. With TinkerGraph and it's indexing functions it so happens that because it is a graph written in Java, programming languages on the JVM, like Java itself, Groovy, Scala, etc will have access to creating indices in the way you've seen in the tutorial. You would find similar issues with other graphs as well like Neo4j or JanusGraph. You are left with a couple of options when it comes to working with non-JVM languages: 1. Create the TinkerGraph in Java, save it to GraphML or similar output, then configure your TinkerGraph in Gremlin Server to create the index in the server startup script and then load the file from disk. 2. Send Gremlin scripts to Gremlin Server from Python to create the index and load the data. Gremlin scripts are currently processed using Groovy so you basically are just using Java APIs remotely from python this way. Generally speaking, you typically use Gremlin Console for approach 1. Folks tend to use it for administrative tasks like loading data, establishing schemas, creating indices, etc. You can read more about how to send scripts in Python for approach 2 here: https://tinkerpop.apache.org/docs/current/reference/#gremlin-python-scripts
qfel
qfelOP•6mo ago
Second approach was my goal but i am not sure if client.submit(...) from first code fragment worked correctly, creating index that is used later in Python's traversal. Do i need to assign g the same way i assigned TinkerGraph to graph? in other words: how do i expose such graph remotely assigned to Python?
spmallette
spmallette•6mo ago
graph is exposed as part of the "graphs" configuration in the server yaml file: https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server.yaml#L23
GitHub
tinkerpop/gremlin-server/conf/gremlin-server.yaml at master · apach...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
spmallette
spmallette•6mo ago
graph could be anything you name it in that file. just the unique key as connected to that specified configuration
qfel
qfelOP•6mo ago
Hmm but afaik Python can only connect via .with_remote(), so i can't call with_graph() directly from Python - is with_remote(conn) aware of index on graph?
spmallette
spmallette•6mo ago
creating an index needs to be done just once with TinkerGraph, then you load your data and it will use the index for any future Gremlin queries you run on it (where it can, of course). Since TinkerGraph is an in-memory graph, the index will be destroyed once the process exits. You would want to create the index each time you spin TinkerGraph back up and load data back into it. let me try to explain again. i'll be more specific (i don't think i did a good job with my two options in retrospect)
spmallette
spmallette•6mo ago
just do this...in your Gremlin Server yaml file you have a pointer to a startup script: https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server-modern.yaml#L29
GitHub
tinkerpop/gremlin-server/conf/gremlin-server-modern.yaml at master ...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
spmallette
spmallette•6mo ago
note that the startup script is just groovy and "onStartUp" it loads the "modern" data into the graph: https://github.com/apache/tinkerpop/blob/master/gremlin-server/scripts/generate-modern.groovy#L28
GitHub
tinkerpop/gremlin-server/scripts/generate-modern.groovy at master ·...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
Solution
spmallette
spmallette•6mo ago
if you simply edit that line of code to create your index and load your vote data, every time you start Gremlin Server it will have that all setup and ready to go.
spmallette
spmallette•6mo ago
you will further note that a little way further down in that startup script that you see the creation of g which is what your python code will references when it connects to send Gremlin queries:
spmallette
spmallette•6mo ago
GitHub
tinkerpop/gremlin-server/scripts/generate-modern.groovy at master ·...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
qfel
qfelOP•6mo ago
Okay i created conf/gremlin-server-votes.yaml
...
org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/sample-votes.groovy]}}}}
...
...
org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/sample-votes.groovy]}}}}
...
and scripts/sample-votes.groovy
...
globals << [hook : [
onStartUp: { ctx ->
ctx.logger.info("Loading 'VOTES' graph data.")
org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerFactory.generateModern(graph)
graph.createIndex('userId', Vertex.class)
}
] as LifeCycleHook]
...
...
globals << [hook : [
onStartUp: { ctx ->
ctx.logger.info("Loading 'VOTES' graph data.")
org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerFactory.generateModern(graph)
graph.createIndex('userId', Vertex.class)
}
] as LifeCycleHook]
...
I call it using gremlin-server conf/gremlin-server-votes.yaml and now - how do i connect to such server from Gremlin Console? Is it
gremlin> :remote connect tinkerpop.server conf/remote.yaml
gremlin> :remote connect tinkerpop.server conf/remote.yaml
and then
graph = TinkerGraph.open()
graph = TinkerGraph.open()
? graph.getIndexedKeys(Vertex) doesn't return anything D: it is confusing that such script does globals << [g : traversal().withEmbedded(graph)] yet client (console or other connection) still needs to assign g also i can't locate remote.yaml
spmallette
spmallette•6mo ago
this part of the script:
org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerFactory.generateModern(graph) graph.createIndex('userId', Vertex.class)
is loading TinkerGraph with the "modern" graph, i assume you don't want that. i think you just want to do:
graph.createIndex('userId',Vertex.class)

// load the wiki vote data here
graph.createIndex('userId',Vertex.class)

// load the wiki vote data here
then startup the server and start connecting with gremlin-python and query your graph if for some reason you want to connect with Gremlin Console using:
gremlin> :remote connect tinkerpop.server conf/remote.yaml
you can do that too. that will let you send Gremlin scripts to that graph as well that approach goes back to my earlier "approach 1"
t is confusing that such script does globals << [g : traversal().withEmbedded(graph)] yet client (console or other connection) still needs to assign g
well, that script is local to the server so i think it makes sense in that context 🤔
qfel
qfelOP•6mo ago
it is only for testing purpose, ultimately i want to load wiki votes in Python and not in Groovy script Yet it doesn't find index when openning graph
spmallette
spmallette•6mo ago
also i can't locate remote.yaml
that should be right in the conf/ directory of your Gremlin Console home sorry, not sure what you mean by "find the index" - could you clarify a bit please?
qfel
qfelOP•6mo ago
I think that there is other way to open graph that i prepared in server script - connecting to server and doing TinkerGraph.open() opens new graph, where there is no modern graph nor indexes (TinkerGraph.open().getIndexedKeys(Vertex) doesn't return anything)
spmallette
spmallette•6mo ago
oh, maybe i know the confusion so when you do this:
gremlin> :remote connect tinkerpop.server conf/remote.yaml
all that does is create a connection to the running Gremlin Server that is hosting your graph you get output like this:
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
what's the next thing you type at the prompt after that? if it's this:
gremlin> TinkerGraph.open().getIndexedKeys(Vertex)
gremlin> TinkerGraph.open().getIndexedKeys(Vertex)
then what you've done is created a new TinkerGraph locally in the Gremlin Console and would thus have no indices
qfel
qfelOP•6mo ago
and there is no graph or g exported directly from server how to get them?
spmallette
spmallette•6mo ago
could you copy/paste your console session text and what errors/output you're getting?
qfel
qfelOP•6mo ago
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.gephi
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> graph
No such property: graph for class: groovysh_evaluate
Type ':help' or ':h' for help.
Display stack trace? [yN]n
gremlin> g
No such property: g for class: groovysh_evaluate
Type ':help' or ':h' for help.
Display stack trace? [yN]n
gremlin>
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.gephi
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> graph
No such property: graph for class: groovysh_evaluate
Type ':help' or ':h' for help.
Display stack trace? [yN]n
gremlin> g
No such property: g for class: groovysh_evaluate
Type ':help' or ':h' for help.
Display stack trace? [yN]n
gremlin>
spmallette
spmallette•6mo ago
yep so, Gremlin Console is a REPL and it's execution is local. that first command creates a connection, but the next two commands just execute locally. there is no graph or g defined there but, if you :submit the command to the server over the current :remote it won't execute locally but instead send that script to the server. you can do that like:
gremlin> :submit graph
// or more commonly
gremlin> :> graph
gremlin> :submit graph
// or more commonly
gremlin> :> graph
or even better, what most folks do, is flip the console into "remote mode" by immediately setting it up like this:
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[0aff5c18-335c-4a64-80f6-cf5301353bb8] - type ':remote console' to return to local mode
gremlin> graph
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[0aff5c18-335c-4a64-80f6-cf5301353bb8] - type ':remote console' to return to local mode
gremlin> graph
qfel
qfelOP•6mo ago
oohh okay :goodgremlin:
spmallette
spmallette•6mo ago
sorry - it's not completely intuitive as it could be remoting came later for Gremlin Console. it was a local REPL first and we built in remoting later after the fact. it works but sometimes folks don't see how it works immediately
qfel
qfelOP•6mo ago
okay so i think that index works fine and it's loaded in python but it takes long to insert those verticies D: how long should it take normally?
qfel
qfelOP•6mo ago
thanks for your help :goodgremlin:
spmallette
spmallette•6mo ago
glad it's working. it shouldn't take too long to load wiki vote. i haven't done it in a while though. how long is it taking?
qfel
qfelOP•6mo ago
~70 seconds
spmallette
spmallette•6mo ago
that could be about right for 7,115 vertices and 103,689 edges and remote operations. the code example isn't meant to be extremely efficient in how it loads the data. it's more meant as an intro to get started.
qfel
qfelOP•6mo ago
is there demo that focuses on loading efficiently some data?
spmallette
spmallette•6mo ago
the thing about efficient loading is that it is highly graph specific. it's not really an issue with TinkerPop because each graph has its own way to optimize loading. with Amazon Neptune you would might go with the CSV loader. For JanusGraph you might use OLAP/Spark. So in that way it's hard to offer general guidelines. For TinkerGraph and loading from python by sending a script with client.submit() explicitly, i suppose the first thing i'd do is parameterize the scripts to take advantage of the cache described here: https://tinkerpop.apache.org/docs/current/reference/#parameterized-scripts You might also try batching requests. the example for the wiki vote load basically makes three request per edge if it's executed as it is. you could reduce that in a variety of ways. an easy one is to load all the vertices first since there's only about 7000 of them and hold their references in a cache client-side. then reference them as needed as you iterate through the 100,000 edges rather than doing a "get or create" request for each in/out vertex on every edge.
Want results from more Discord servers?
Add your server