Setting index in gremlin-python
I was trying to create vote graph from tutorial on loading data in gremlin-python and afaik you can't simply add index from non-JVM languages because for example there is no TinkerGraph that you could
.open()
. I don't know how better is performance when having index on 'userId' but my code simply takes too long go through queries from vote file.
I tried using client functionality
to do it from string query and i'm not sure if with_remote(conn)
uses previously assigned graph
, let me know how to do it correctly. I'm not sure how to assign to g
from client.submit(...)
.
Additionally: how does one speed up those queries, if setting index won't do it?
In my implementation
call to idToNode
for each line takes too long.Solution:Jump to solution
if you simply edit that line of code to create your index and load your vote data, every time you start Gremlin Server it will have that all setup and ready to go.
30 Replies
TinkerPop doesn't generalize APIs for indices. You have to rely on the APIs provided by the graph itself to do that. As there is no general API for indices, programming language support for Gremlin like python, .NET, javascript and others don't offer any such access.
With TinkerGraph and it's indexing functions it so happens that because it is a graph written in Java, programming languages on the JVM, like Java itself, Groovy, Scala, etc will have access to creating indices in the way you've seen in the tutorial. You would find similar issues with other graphs as well like Neo4j or JanusGraph.
You are left with a couple of options when it comes to working with non-JVM languages:
1. Create the TinkerGraph in Java, save it to GraphML or similar output, then configure your TinkerGraph in Gremlin Server to create the index in the server startup script and then load the file from disk.
2. Send Gremlin scripts to Gremlin Server from Python to create the index and load the data. Gremlin scripts are currently processed using Groovy so you basically are just using Java APIs remotely from python this way.
Generally speaking, you typically use Gremlin Console for approach 1. Folks tend to use it for administrative tasks like loading data, establishing schemas, creating indices, etc. You can read more about how to send scripts in Python for approach 2 here: https://tinkerpop.apache.org/docs/current/reference/#gremlin-python-scripts
Second approach was my goal but i am not sure if
client.submit(...)
from first code fragment worked correctly, creating index that is used later in Python's traversal. Do i need to assign g
the same way i assigned TinkerGraph to graph
?
in other words: how do i expose such graph
remotely assigned to Python?graph
is exposed as part of the "graphs" configuration in the server yaml file: https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server.yaml#L23GitHub
tinkerpop/gremlin-server/conf/gremlin-server.yaml at master · apach...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
graph
could be anything you name it in that file. just the unique key as connected to that specified configurationHmm
but afaik Python can only connect via
.with_remote()
, so i can't call with_graph()
directly from Python - is with_remote(conn)
aware of index on graph
?creating an index needs to be done just once with TinkerGraph, then you load your data and it will use the index for any future Gremlin queries you run on it (where it can, of course). Since TinkerGraph is an in-memory graph, the index will be destroyed once the process exits. You would want to create the index each time you spin TinkerGraph back up and load data back into it.
let me try to explain again. i'll be more specific
(i don't think i did a good job with my two options in retrospect)
just do this...in your Gremlin Server yaml file you have a pointer to a startup script: https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server-modern.yaml#L29
GitHub
tinkerpop/gremlin-server/conf/gremlin-server-modern.yaml at master ...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
note that the startup script is just groovy and "onStartUp" it loads the "modern" data into the
graph
: https://github.com/apache/tinkerpop/blob/master/gremlin-server/scripts/generate-modern.groovy#L28GitHub
tinkerpop/gremlin-server/scripts/generate-modern.groovy at master ·...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
Solution
if you simply edit that line of code to create your index and load your vote data, every time you start Gremlin Server it will have that all setup and ready to go.
you will further note that a little way further down in that startup script that you see the creation of
g
which is what your python code will references when it connects to send Gremlin queries:GitHub
tinkerpop/gremlin-server/scripts/generate-modern.groovy at master ·...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
Okay i created
conf/gremlin-server-votes.yaml
and scripts/sample-votes.groovy
I call it using gremlin-server conf/gremlin-server-votes.yaml
and now - how do i connect to such server from Gremlin Console? Is it
and then
?
graph.getIndexedKeys(Vertex)
doesn't return anything D:
it is confusing that such script does globals << [g : traversal().withEmbedded(graph)]
yet client (console or other connection) still needs to assign g
also i can't locate remote.yaml
this part of the script:
org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerFactory.generateModern(graph) graph.createIndex('userId', Vertex.class)is loading TinkerGraph with the "modern" graph, i assume you don't want that. i think you just want to do: then startup the server and start connecting with gremlin-python and query your graph if for some reason you want to connect with Gremlin Console using:
gremlin> :remote connect tinkerpop.server conf/remote.yamlyou can do that too. that will let you send Gremlin scripts to that graph as well that approach goes back to my earlier "approach 1"
t is confusing that such script does globals << [g : traversal().withEmbedded(graph)] yet client (console or other connection) still needs to assign gwell, that script is local to the server so i think it makes sense in that context 🤔
it is only for testing purpose, ultimately i want to load wiki votes in Python and not in Groovy script
Yet it doesn't find index when openning graph
also i can't locate remote.yamlthat should be right in the
conf/
directory of your Gremlin Console home
sorry, not sure what you mean by "find the index" - could you clarify a bit please?I think that there is other way to open
graph
that i prepared in server script - connecting to server and doing TinkerGraph.open()
opens new graph, where there is no modern graph nor indexes (TinkerGraph.open().getIndexedKeys(Vertex)
doesn't return anything)oh, maybe i know the confusion
so when you do this:
gremlin> :remote connect tinkerpop.server conf/remote.yamlall that does is create a connection to the running Gremlin Server that is hosting your graph you get output like this: what's the next thing you type at the prompt after that? if it's this: then what you've done is created a new TinkerGraph locally in the Gremlin Console and would thus have no indices
and there is no
graph
or g
exported directly from server
how to get them?could you copy/paste your console session text and what errors/output you're getting?
yep
so, Gremlin Console is a REPL and it's execution is local. that first command creates a connection, but the next two commands just execute locally. there is no
graph
or g
defined there
but, if you :submit
the command to the server over the current :remote
it won't execute locally but instead send that script to the server. you can do that like:
or even better, what most folks do, is flip the console into "remote mode" by immediately setting it up like this:
oohh okay :goodgremlin:
sorry - it's not completely intuitive as it could be
remoting came later for Gremlin Console. it was a local REPL first and we built in remoting later after the fact. it works but sometimes folks don't see how it works immediately
okay so i think that index works fine and it's loaded in python but it takes long to insert those verticies D:
how long should it take normally?
thanks for your help :goodgremlin:
glad it's working. it shouldn't take too long to load wiki vote. i haven't done it in a while though. how long is it taking?
~70 seconds
that could be about right for 7,115 vertices and 103,689 edges and remote operations. the code example isn't meant to be extremely efficient in how it loads the data. it's more meant as an intro to get started.
is there demo that focuses on loading efficiently some data?
the thing about efficient loading is that it is highly graph specific. it's not really an issue with TinkerPop because each graph has its own way to optimize loading. with Amazon Neptune you would might go with the CSV loader. For JanusGraph you might use OLAP/Spark. So in that way it's hard to offer general guidelines.
For TinkerGraph and loading from python by sending a script with
client.submit()
explicitly, i suppose the first thing i'd do is parameterize the scripts to take advantage of the cache described here: https://tinkerpop.apache.org/docs/current/reference/#parameterized-scripts
You might also try batching requests. the example for the wiki vote load basically makes three request per edge if it's executed as it is. you could reduce that in a variety of ways. an easy one is to load all the vertices first since there's only about 7000 of them and hold their references in a cache client-side. then reference them as needed as you iterate through the 100,000 edges rather than doing a "get or create" request for each in/out vertex on every edge.