Apache TinkerPop

AT

Apache TinkerPop

Join the community to ask questions about Apache TinkerPop and get answers from other members.

Join

Tinkerpop Server OOM

Hi Tinkerpop team, I'm trying to make sense of this OOMing that seems to consistently occur in my environment over the course of usually a couple hours. Attached is a screenshot of the JVM GC behavior metrics showing before & after a GC. It's almost like the underlying live memory continues to grow but I'm not sure why....
Solution:
Sorry for the delayed response. I'll try to take a look at this soon. But for now, I just wanted to point out that SingleTaskSession and the like are part of the UnifiedChannelizer. From what I remember, the UnifiedChannelizer isn't quite production ready, and in fact is being removed in the next major version of TinkerPop. We can certainly still make bug/performance fixes to this part of the code for 3.7.x though.
No description

Good CLI REPL allowing unlabeled edges?

Is there another tool like Gremlin with a REPL but perhaps overall simpler? I’m mainly looking for the ability to make labeled nodes and unlabeled directed binary edges (arrows) between nodes. (On the other hand, I can use a generic label for every level in my Gremlin graph, I guess.)...
Solution:
i think the recommendation would be to do as you suggested at the end of your qeustion and to just use default labels and just ignore them in Gremlin, like g.V().out() as opposed to g.V().out('default'). speaking more to your questions, i'm not sure what other graph frameworks you might use. i could be wrong, but i think NetworkX lets you create labelless graph elements: https://networkx.org/

Best practices for local development with Neptune.

I would like to use a local Gremlin server with TinkerGraph for local development, and then deploy changes to Neptune later. However, there are several differences between TinkerGraph and Neptune that impact the portability of the code. The most important one is probably the fact that in Tinkergraph vertex and edge ids are numeric, but they are strings in Neptune. Also, I think there are some differences in how properties are handled if the cardinality is a list. What is the recommended workflow to minimize discrepancies between my local environment and Neptune?...
Solution:
There's a blog post here that contains some of the details on what properties you can change in TinkerGraph to get close: https://aws.amazon.com/blogs/database/automated-testing-of-amazon-neptune-data-access-with-apache-tinkerpop-gremlin/ It's unlikely that you'll find anything that emulates things like the result cache, lookup cache, full-text-search, features, etc.
I would be curious to hear what the needs are for local dev....

Sequential edge creation between streamed vertices

I would like to create an edge between vertices as they are streamed in sequence from a traversal. I want to connect each vertex to the next one in the stream, like a linear chain of vertices with a next edge. For example, given this g.V().hasLabel("person").values("name") produces: ``` ==> josh...
Solution:
i think that's about as good as most approaches. we're missing a step that coudl simplify this code though. i've long wanted a partition() step so that you could change all that code to just: ``` g.V().hasLabel('person'). partition(2). addE('next').from.......

[Bug?] gremlinpython is hanged up or not recovering connection after connection error has occurred

Hello, TinkerPop team. I am struggling to avoid problems after a connection error occur. And now, I suspect it might be led by something bug of gremlinpython... ...
Solution:
What you're noticing here kind of boils down to how connection pooling works in gremlin-python. The pool is really just a queue that the connection adds itself back to after either an error or a success but it's missing some handling for the scenarios you pointed out. One of the main issues is that the pool itself can't determine if a connection is healthy or if it unhealthy and should be removed from the pool. I think you should go ahead and make a Jira for this. If it's easier for you, I can help you make one that references this post. I think the only workaround right now is to occasionally open a new Client to create a new pool of connections when you notice some of those exceptions....

Vertex hashmaps

Hi, I'm looking to copy subgraphs, if there are better practices for this in general, please let me know I'm currently looking at emitting a subtree, then creating new vertices, storing a mapping of the original to the copy, and reusing this mapping to build out the relationships for the copied vertices. I'm not sure how I should be doing this, currently I'm trying to use the aggregate step to store the original/copy pairs, but I'm not sure how to select nodes from this in future steps....
Solution:
since you tagged this question with javascript i think that aggregate() is probably your best approach. in java, you would probably prefer subgraph() because it gives you a Graph representation which you could in turn run Gremlin on and as a result is quite convenient. we hope to see better support for subgraph() in javascript (and other language variants) in future releases.

Benchmarking

Hi everyone, how to benchmark with gremlin?
Solution:
that's a fairly broad question, so i'll give a broad answer. one of the nice things about TinkerPop is that it lets you connect to a lot of different graph databases with the same code, so it does allow you to compare performance of different graph databases. that said, doing a good benchmark is still a bit hard as it's not enough to just use Gremlin to generate a random graph and issue a few queries. among other things, a critical step is to gain a decent understanding of the workings of the gr...

How to improve Performance using MergeV and MergeE?

I made an implementation similar to this: g.mergeV([(id): 'vertex1'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])).mergeV([(id): 'vertex2'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])).mergeV([(id): 'vertex3'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])) So I'm send 2 requests to neptune. The first one with 11 vertexes and the second with 10 edges in two different requests and doing a performance test using neptune. The duration of the process for this amount of content is like 200ms-500ms. Is there a way to improve this query to be faster? For connection I'm using gremlin = client.Client(neptune_url, 'g', transport_factory=lambda: AiohttpTransport(call_from_event_loop=True), message_serializer=serialier.GraphSONMessageSerializer()) so I send this query by gremlini.submit(query)...
Solution:
In general, the method to get the best write performance/throughput on Neptune is to both batch multiple writes into a single requests and then do multiple batched writes in parallel. Neptune stores each atomic component of the graph as separate records (node, edge, and property). For example, if you have a node with 4 properties, that turns into 5 records in Neptune. A batched write query with around 100-200 records is a sweet spot that we've found in testing. So issuing queries with that many records and running those in parallel should provide better throughput. Conditional writes will slow things down, as additional locks are being taken to ensure data consistency. So writes that use straight addV(), addE(), property() steps will be faster than using mergeV() or mergeE(). The latter can also incur more deadlocks (exposed in Neptune as ConcurrentModificationExceptions). So it is also good practice to implement exponential backoff and retries whenever doing parallel writes into Neptune....

Why is T.label immutable and do we have to create a new node to change a label?

We cannot do g.V('some label').property(T.label, 'new label').iterate() ? Is this correct? Thank you
Solution:
you have a few questions here @Julius Hamilton
Why is T.label immutable
i'm not sure there's a particular reason except to say that many graphs have not allowed that functionality so TinkerPop hasn't offered a way to do it.
and do we have to create a new node to change a label?...

Simple question about printing vertex labels

I am creating a graph in the gremlin cli by doing graph = TinkerGraph.open(), g = graph.traversal(), g.addV("somelabel"). i can confirm a vertex was created. i can do g.V().valueMap(true) and it shows ==>[id:0,label:documents]. But I so far do not know how to print information about a vertex via its index. I have tried g.V(0) but it doesnt print anything.
Solution:
By default, IDs are stored as longs. You likely need to use g.V(0L) in Gremlin Console to return the vertex that you created.

Defining Hypergraphs

I want to create a software system where a person can create labeled nodes, and then define labeled edges between the nodes. However, edges also count as nodes, which means you can have edges between edges and edges, edges and nodes, edges-between-edges-and-nodes and edges-between-nodes-and-nodes, and so on. This type of hypergraph is described well here in this Wikipedia article:
One possible generalization of a hypergraph is to allow edges to point at other edges. There are two variations of this generalization. In one, the edges consist not only of a set of vertices, but may also contain subsets of vertices, subsets of subsets of vertices and so on ad infinitum. In essence, every edge is just an internal node of a tree or directed acyclic graph, and vertices are the leaf nodes. A hypergraph is then just a collection of trees with common, shared nodes (that is, a given internal node or leaf may occur in several different trees). Conversely, every collection of trees can be understood as this generalized hypergraph....
Solution:
I always understood that back in the day Marko and crew decided that hypergraphs can be modeled by a property graph. You have to stick a vertex in the middle that represents the hyper edge. This leaves the query language without any first class constructs about navigating a hyper edges but everything is reachable. Another problem would be performance, the more abstraction away from the implementation on disc, the slower the graph becomes....

JanusGraph AdjacentVertex Optimization

Hiya, I'm wondering if anyone has any advice on how to inspect the provider-side optimizations being applied to my gremlin code by janus graph. Currently when I call explain I get the following output. ``` Original Traversal [GraphStep(vertex,[]), HasStep([plabel.eq(Person)])@[a], VertexStep(OUT,vert...
Solution:
TinkerPop applies all optimization strategies to all queries (including JanusGraph internal optimizations). However, JanusGraph skips some of the optimizations as it sees necessary. We don't currently store information if the optimization strategy modified any part of the query or was simply skipped (potential feature request). Thus, the way I would test if the optimization strategy actually makes any changes or not is to debug the query with the breaking point placed in the necessary optimization strategy. I.e. in your case I would place a breaking point here: https://github.com/JanusGraph/janusgraph/blob/c9576890b5e9dc48676ccc16a58552b8a665e5f0/janusgraph-core/src/main/java/org/janusgraph/graphdb/tinkerpop/optimize/strategy/AdjacentVertexOptimizerStrategy.java#L58C13-L58C28 If this part is triggered during your query execution - optimization works in this case....

Efficient degree computation for traversals of big graphs

Hello, We're trying to use Neptune with gremlin for a fairly big (XX m nodes) graph and our queries usually have to filter out low degree vertices at some point in the query for both efficiency and product reasons. According to the profiler, this operation takes the big brunt of our computation time. At the same time, the computation of the degree is something that could be pre-computed on the database (we dont even need it to be 100% accurate, a "recent" snapshot computation would be good enough) which would significantly speed up the query. Is there anyone here who has some trick up their sleeve for degree computation that would work well, either as an ad-hoc snippet for a query or as a nice way to precompute it on the graph? ...
Solution:
I would first mention that the profile() step in Gremlin is different than the Neptune Profile API. The latter is going to provide a great deal more info, including whether or not the entire query is being optimized by Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html If you have 10s of millions of nodes, you could use Neptune Analytics to do the degree calculations. Then extract the degree properties from NA, delete the NA graph, and bulk load those values back into NDB. We're working to make this round-trip process more seamless. But it isn't too hard to automate in the current form....

Gremlin query to order vertices with some locked to specific positions

I'm working with a product catalog in a graph database using Gremlin. The graph structure includes: 1. Product vertices...
Solution:
you can play tricks like this to move the 0 index to last place, but that still leaves the 7 one row off ``` gremlin> g.V().hasLabel("Category"). ......1> inE("belongsTo").as('a').outV(). ......2> path()....

Is it possible to configure SSL with PEM certificate types?

Hi all, I'm new to this group and currently working getting an implementation of Gremlin (Aerospike Graph) to listen over SSL. The certificates we get from our provider's API are only served in PEM format. It appears, according to the documentation that the keyStoreType and trustStoreType either JKS or PKCS12 format: https://tinkerpop.apache.org/javadocs/current/full/org/apache/tinkerpop/gremlin/server/Settings.SslSettings.html Is this true? Is there any way for us to configure SSL with PEM format certificates?...
Solution:
Hi @joshb, am I correct in assuming you are using the Java driver to connect to Aerospike? The java driver uses the JSSE keyStore and trustStore, which as far as I understand does not support the PEM format. You may be able to use a 3rd party tool such as openssl to convert from PEM to PKCS12 (https://docs.openssl.org/1.1.1/man1/pkcs12/). Perhaps @aerospike folks may have more direct recommendations for driver configuration....

Query works when executed in console but not in javascript

``` const combinedQuery = gremlin.V(profileId) .project('following', 'follows') .by( __.inE('FOLLOWS').outV().dedup().id().fold()...

Very slow regex query (AWS Neptune)

We have a query that searches a data set of about ~400,000 vertices, matching properties using a case insensitive TextP.regex() expression. We are observing very bad query performance; even after several other optimizations, it still takes 20-45 seconds, often timing out. Simplified query: ``` g.V()...
Solution:
When Neptune stores data it stores it in 3 different indexed formats (https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html#feature-overview-storage-indexing), each of which are optimized for a specific set of common graph patterns. Each of these indexes is optimized for exact match lookups so when running queries that require partial text matches, such as a regex query, all the matching property data needs to be scanned to see if it matches the provided expression.
To get a performant query for partial text matches the suggestion is to use the Full Text search integration (https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html) , which will integrate with OpenSearch to provide robust full text searching capabilities within a Gremlin query...

How can we extract values only

"Latitude", { "@type": "g:Double", "@value": 45.2613104 },...
Solution:
If I understand this correctly, you are first trying to take the result of a gremlin query that returns Latitude and Longitude (like in the initial post you made), and use those values in the in the math() step that calculates the Haversine formula (your latest post). If that is the case, you have two options. 1. You should combine this into one Gremlin query. You can save the results of the Latitude and Longitude to variables or use them in a by() modulator to the math step. Assuming that those values are properties on a vertex called 'lat' and 'lon' it would look something like g.V().project('Latitude', 'Longitude').by('lat').by('lon').math(...). You would replace the ... in the math() step with the Haversine formula. 2. If you want to keep these as two separate queries, then you should use one of the Gremlin Languages Variants (GLVs) which are essentially drivers that will automatically deserialize the result into the appropriate type so you don't have to deal with the GraphSON (which is what your initial post shows). Read triggan's answer above for more details about that....

How to speed up gremlin query

Hi, I am working with Janusgraph and my query is taking a while to execute (around 2.8 seconds), but I would like it to be faster. I read that I should create a composite index to improve speed and performance or something of that sort, but I am unfamiliar with how to do that in Python. Here is my query: g.V().has("person", "name", "Bob").outE("knows").has("weight", P.gte(0.5)).inV().values("name").toList() What my query does is it finds all the nodes that Bob has relationship "knows" with, as long as the weight of the edge to those nodes are >=0.5. Bob has around ~600 nodes that it's connected to with the "knows" relationship. It's fairly slow and takes 2.5-2.8 secs to complete....
Solution:
I read that I should create a composite index to improve speed and performance or something of that sort, but I am unfamiliar with how to do that in Python.
as a point of clarification around indices, you wouldn't likely do that step with python. you typically use gremlinpython just to write Gremlin. For index management, you need to use JanusGraph's APIs directly. often those are just commands you would execute directly in Gremlin Console against a JanusGraph instance....

logging and alerting inside a gremlin step

I am trying to add a log statement inside the step but I am getting an error Server error: null (599) ResponseError: Server error: null (599) This is the error we get // Log the creation of a shell flight .sideEffect(() => logger.info(Shell flight created for paxKey ${paxKey} with flightId ${flightLegID})) ...
Solution:
Since you are getting a "Server error" I assume you are using Gremlin in a remote context. I'd further assume you are not sending a script to the server, but are using bytecode or are using a graph like Amazon Neptune. Those assumptions all point to the fact that you can't use a lambda that way in remote contexts. The approach you are using to write your lambda is for embedded use cases only (i.e. where query is executed in the same JVM where it was created). If you want to send a lambda remotely you would need to have a server that supports them (e.g. Neptune does not, but Gremlin Server with Groovy ScriptEngine processing does) and then follow these instructions: https://tinkerpop.apache.org/docs/current/reference/#gremlin-java-lambda The other thing to consider is that the lambda will be executed remotely so it might not know what "logger" is and the log message will appear on the server note the client. ...