Apache TinkerPop•10mo ago

Setting index in gremlin-python

I was trying to create vote graph from tutorial on loading data in gremlin-python and afaik you can't simply add index from non-JVM languages because for example there is no TinkerGraph that you could .open(). I don't know how better is performance when having index on 'userId' but my code simply takes too long go through queries from vote file. I tried using client functionality

ws_url = 'ws://localhost:8182/gremlin'

# Create index on userId
client = Client(ws_url, 'g')
client.submit('graph = TinkerGraph.open()')
client.submit("graph.createIndex('userId', Vertex.class)")
client.close()

conn = DriverRemoteConnection(ws_url, 'g')
g = traversal().with_remote(conn)

ws_url = 'ws://localhost:8182/gremlin'

# Create index on userId
client = Client(ws_url, 'g')
client.submit('graph = TinkerGraph.open()')
client.submit("graph.createIndex('userId', Vertex.class)")
client.close()

conn = DriverRemoteConnection(ws_url, 'g')
g = traversal().with_remote(conn)

to do it from string query and i'm not sure if with_remote(conn) uses previously assigned graph, let me know how to do it correctly. I'm not sure how to assign to g from client.submit(...). Additionally: how does one speed up those queries, if setting index won't do it? In my implementation

def idToNode(g: GraphTraversalSource, id: str):
    return g.V().has('user', 'userId', id) \
            .fold() \
            .coalesce(__.unfold(), 
                      __.add_v('user').property('userId', id)) \
            .next()

def loadVotes():
    with open("/tmp/wiki-Vote.txt", "r") as file:
        for _ in range(4):
            next(file)

        for line in file:
            ids = line.split('\t')
            from_node = idToNode(g, ids[0])
            to_node = idToNode(g, ids[1])
            g.add_e('votesFor').from_(from_node).to(to_node).iterate()

def idToNode(g: GraphTraversalSource, id: str):
    return g.V().has('user', 'userId', id) \
            .fold() \
            .coalesce(__.unfold(), 
                      __.add_v('user').property('userId', id)) \
            .next()

def loadVotes():
    with open("/tmp/wiki-Vote.txt", "r") as file:
        for _ in range(4):
            next(file)

        for line in file:
            ids = line.split('\t')
            from_node = idToNode(g, ids[0])
            to_node = idToNode(g, ids[1])
            g.add_e('votesFor').from_(from_node).to(to_node).iterate()

call to idToNode for each line takes too long.

Solution:

if you simply edit that line of code to create your index and load your vote data, every time you start Gremlin Server it will have that all setup and ready to go.

Jump to solution

30 Replies

spmallette•10mo ago

TinkerPop doesn't generalize APIs for indices. You have to rely on the APIs provided by the graph itself to do that. As there is no general API for indices, programming language support for Gremlin like python, .NET, javascript and others don't offer any such access. With TinkerGraph and it's indexing functions it so happens that because it is a graph written in Java, programming languages on the JVM, like Java itself, Groovy, Scala, etc will have access to creating indices in the way you've seen in the tutorial. You would find similar issues with other graphs as well like Neo4j or JanusGraph. You are left with a couple of options when it comes to working with non-JVM languages: 1. Create the TinkerGraph in Java, save it to GraphML or similar output, then configure your TinkerGraph in Gremlin Server to create the index in the server startup script and then load the file from disk. 2. Send Gremlin scripts to Gremlin Server from Python to create the index and load the data. Gremlin scripts are currently processed using Groovy so you basically are just using Java APIs remotely from python this way. Generally speaking, you typically use Gremlin Console for approach 1. Folks tend to use it for administrative tasks like loading data, establishing schemas, creating indices, etc. You can read more about how to send scripts in Python for approach 2 here: https://tinkerpop.apache.org/docs/current/reference/#gremlin-python-scripts

qfelOP•10mo ago

Second approach was my goal but i am not sure if client.submit(...) from first code fragment worked correctly, creating index that is used later in Python's traversal. Do i need to assign g the same way i assigned TinkerGraph to graph? in other words: how do i expose such graph remotely assigned to Python?

spmallette•10mo ago

graph is exposed as part of the "graphs" configuration in the server yaml file: https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server.yaml#L23

GitHub

tinkerpop/gremlin-server/conf/gremlin-server.yaml at master · apach...

Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.

spmallette•10mo ago

graph could be anything you name it in that file. just the unique key as connected to that specified configuration

qfelOP•10mo ago

Hmm but afaik Python can only connect via .with_remote(), so i can't call with_graph() directly from Python - is with_remote(conn) aware of index on graph?

spmallette•10mo ago

creating an index needs to be done just once with TinkerGraph, then you load your data and it will use the index for any future Gremlin queries you run on it (where it can, of course). Since TinkerGraph is an in-memory graph, the index will be destroyed once the process exits. You would want to create the index each time you spin TinkerGraph back up and load data back into it. let me try to explain again. i'll be more specific (i don't think i did a good job with my two options in retrospect)

spmallette•10mo ago

just do this...in your Gremlin Server yaml file you have a pointer to a startup script: https://github.com/apache/tinkerpop/blob/master/gremlin-server/conf/gremlin-server-modern.yaml#L29

GitHub

tinkerpop/gremlin-server/conf/gremlin-server-modern.yaml at master ...

Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.

spmallette•10mo ago

note that the startup script is just groovy and "onStartUp" it loads the "modern" data into the graph: https://github.com/apache/tinkerpop/blob/master/gremlin-server/scripts/generate-modern.groovy#L28

GitHub

tinkerpop/gremlin-server/scripts/generate-modern.groovy at master ·...

Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.

Solution

spmallette•10mo ago

if you simply edit that line of code to create your index and load your vote data, every time you start Gremlin Server it will have that all setup and ready to go.

spmallette•10mo ago

you will further note that a little way further down in that startup script that you see the creation of g which is what your python code will references when it connects to send Gremlin queries:

spmallette•10mo ago

https://github.com/apache/tinkerpop/blob/master/gremlin-server/scripts/generate-modern.groovy#L33

GitHub

tinkerpop/gremlin-server/scripts/generate-modern.groovy at master ·...

Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.

qfelOP•10mo ago

Okay i created conf/gremlin-server-votes.yaml

...
               org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/sample-votes.groovy]}}}}
...

...
               org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/sample-votes.groovy]}}}}
...

and scripts/sample-votes.groovy

...
globals << [hook : [
  onStartUp: { ctx ->
    ctx.logger.info("Loading 'VOTES' graph data.")
    org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerFactory.generateModern(graph)
    graph.createIndex('userId', Vertex.class)
  }
] as LifeCycleHook]
...

...
globals << [hook : [
  onStartUp: { ctx ->
    ctx.logger.info("Loading 'VOTES' graph data.")
    org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerFactory.generateModern(graph)
    graph.createIndex('userId', Vertex.class)
  }
] as LifeCycleHook]
...

I call it using gremlin-server conf/gremlin-server-votes.yaml and now - how do i connect to such server from Gremlin Console? Is it

gremlin> :remote connect tinkerpop.server conf/remote.yaml

gremlin> :remote connect tinkerpop.server conf/remote.yaml

and then

graph = TinkerGraph.open()

graph = TinkerGraph.open()

? graph.getIndexedKeys(Vertex) doesn't return anything D: it is confusing that such script does globals << [g : traversal().withEmbedded(graph)] yet client (console or other connection) still needs to assign g also i can't locate remote.yaml

spmallette•10mo ago

this part of the script:

org.apache.tinkerpop.gremlin.tinkergraph.structure.TinkerFactory.generateModern(graph) graph.createIndex('userId', Vertex.class)

is loading TinkerGraph with the "modern" graph, i assume you don't want that. i think you just want to do:

graph.createIndex('userId',Vertex.class)

// load the wiki vote data here

graph.createIndex('userId',Vertex.class)

// load the wiki vote data here

then startup the server and start connecting with gremlin-python and query your graph if for some reason you want to connect with Gremlin Console using:

gremlin> :remote connect tinkerpop.server conf/remote.yaml

you can do that too. that will let you send Gremlin scripts to that graph as well that approach goes back to my earlier "approach 1"

t is confusing that such script does globals << [g : traversal().withEmbedded(graph)] yet client (console or other connection) still needs to assign g

well, that script is local to the server so i think it makes sense in that context 🤔

qfelOP•10mo ago

it is only for testing purpose, ultimately i want to load wiki votes in Python and not in Groovy script Yet it doesn't find index when openning graph

spmallette•10mo ago

also i can't locate remote.yaml

that should be right in the conf/ directory of your Gremlin Console home sorry, not sure what you mean by "find the index" - could you clarify a bit please?

qfelOP•10mo ago

I think that there is other way to open graph that i prepared in server script - connecting to server and doing TinkerGraph.open() opens new graph, where there is no modern graph nor indexes (TinkerGraph.open().getIndexedKeys(Vertex) doesn't return anything)

spmallette•10mo ago

oh, maybe i know the confusion so when you do this:

gremlin> :remote connect tinkerpop.server conf/remote.yaml

all that does is create a connection to the running Gremlin Server that is hosting your graph you get output like this:

gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182

gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182

what's the next thing you type at the prompt after that? if it's this:

gremlin> TinkerGraph.open().getIndexedKeys(Vertex)

gremlin> TinkerGraph.open().getIndexedKeys(Vertex)

then what you've done is created a new TinkerGraph locally in the Gremlin Console and would thus have no indices

qfelOP•10mo ago

and there is no graph or g exported directly from server how to get them?

spmallette•10mo ago

could you copy/paste your console session text and what errors/output you're getting?

qfelOP•10mo ago

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.gephi
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> graph
No such property: graph for class: groovysh_evaluate
Type ':help' or ':h' for help.
Display stack trace? [yN]n
gremlin> g
No such property: g for class: groovysh_evaluate
Type ':help' or ':h' for help.
Display stack trace? [yN]n
gremlin>

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.gephi
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> graph
No such property: graph for class: groovysh_evaluate
Type ':help' or ':h' for help.
Display stack trace? [yN]n
gremlin> g
No such property: g for class: groovysh_evaluate
Type ':help' or ':h' for help.
Display stack trace? [yN]n
gremlin>

spmallette•10mo ago

yep so, Gremlin Console is a REPL and it's execution is local. that first command creates a connection, but the next two commands just execute locally. there is no graph or g defined there but, if you :submit the command to the server over the current :remote it won't execute locally but instead send that script to the server. you can do that like:

gremlin> :submit graph
// or more commonly
gremlin> :> graph

gremlin> :submit graph
// or more commonly
gremlin> :> graph

or even better, what most folks do, is flip the console into "remote mode" by immediately setting it up like this:

gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[0aff5c18-335c-4a64-80f6-cf5301353bb8] - type ':remote console' to return to local mode
gremlin> graph

gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[0aff5c18-335c-4a64-80f6-cf5301353bb8] - type ':remote console' to return to local mode
gremlin> graph

qfelOP•10mo ago

oohh okay :goodgremlin:

spmallette•10mo ago

sorry - it's not completely intuitive as it could be remoting came later for Gremlin Console. it was a local REPL first and we built in remoting later after the fact. it works but sometimes folks don't see how it works immediately

qfelOP•10mo ago

okay so i think that index works fine and it's loaded in python but it takes long to insert those verticies D: how long should it take normally?

qfelOP•10mo ago

thanks for your help :goodgremlin:

spmallette•10mo ago

glad it's working. it shouldn't take too long to load wiki vote. i haven't done it in a while though. how long is it taking?

qfelOP•10mo ago

~70 seconds

spmallette•10mo ago

that could be about right for 7,115 vertices and 103,689 edges and remote operations. the code example isn't meant to be extremely efficient in how it loads the data. it's more meant as an intro to get started.

qfelOP•10mo ago

is there demo that focuses on loading efficiently some data?

spmallette•10mo ago

the thing about efficient loading is that it is highly graph specific. it's not really an issue with TinkerPop because each graph has its own way to optimize loading. with Amazon Neptune you would might go with the CSV loader. For JanusGraph you might use OLAP/Spark. So in that way it's hard to offer general guidelines. For TinkerGraph and loading from python by sending a script with client.submit() explicitly, i suppose the first thing i'd do is parameterize the scripts to take advantage of the cache described here: https://tinkerpop.apache.org/docs/current/reference/#parameterized-scripts You might also try batching requests. the example for the wiki vote load basically makes three request per edge if it's executed as it is. you could reduce that in a variety of ways. an easy one is to load all the vertices first since there's only about 7000 of them and hold their references in a cache client-side. then reference them as needed as you iterate through the 100,000 edges rather than doing a "get or create" request for each in/out vertex on every edge.

Gaming

Programming

Setting index in gremlin-python

Did you find this page helpful?