Apache TinkerPop

AT

Apache TinkerPop

Apache TinkerPop is an open source graph computing framework and the home of the Gremlin graph query language.

Join

[Bug?] gremlinpython is hanged up or not recovering connection after connection error has occurred

Hello, TinkerPop team. I am struggling to avoid problems after a connection error occur. And now, I suspect it might be led by something bug of gremlinpython... ...
Solution:
What you're noticing here kind of boils down to how connection pooling works in gremlin-python. The pool is really just a queue that the connection adds itself back to after either an error or a success but it's missing some handling for the scenarios you pointed out. One of the main issues is that the pool itself can't determine if a connection is healthy or if it unhealthy and should be removed from the pool. I think you should go ahead and make a Jira for this. If it's easier for you, I can help you make one that references this post. I think the only workaround right now is to occasionally open a new Client to create a new pool of connections when you notice some of those exceptions....

Vertex hashmaps

Hi, I'm looking to copy subgraphs, if there are better practices for this in general, please let me know I'm currently looking at emitting a subtree, then creating new vertices, storing a mapping of the original to the copy, and reusing this mapping to build out the relationships for the copied vertices. I'm not sure how I should be doing this, currently I'm trying to use the aggregate step to store the original/copy pairs, but I'm not sure how to select nodes from this in future steps....
Solution:
since you tagged this question with javascript i think that aggregate() is probably your best approach. in java, you would probably prefer subgraph() because it gives you a Graph representation which you could in turn run Gremlin on and as a result is quite convenient. we hope to see better support for subgraph() in javascript (and other language variants) in future releases.

Benchmarking

Hi everyone, how to benchmark with gremlin?
Solution:
that's a fairly broad question, so i'll give a broad answer. one of the nice things about TinkerPop is that it lets you connect to a lot of different graph databases with the same code, so it does allow you to compare performance of different graph databases. that said, doing a good benchmark is still a bit hard as it's not enough to just use Gremlin to generate a random graph and issue a few queries. among other things, a critical step is to gain a decent understanding of the workings of the gr...

How to improve Performance using MergeV and MergeE?

I made an implementation similar to this: g.mergeV([(id): 'vertex1'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])).mergeV([(id): 'vertex2'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])).mergeV([(id): 'vertex3'].option(onCreate, [(label): 'Person', 'property1': 'value1', 'updated_at': 'value2']).option(onMatch, ['updated_at': 'value2'])) So I'm send 2 requests to neptune. The first one with 11 vertexes and the second with 10 edges in two different requests and doing a performance test using neptune. The duration of the process for this amount of content is like 200ms-500ms. Is there a way to improve this query to be faster? For connection I'm using gremlin = client.Client(neptune_url, 'g', transport_factory=lambda: AiohttpTransport(call_from_event_loop=True), message_serializer=serialier.GraphSONMessageSerializer()) so I send this query by gremlini.submit(query)...
Solution:
In general, the method to get the best write performance/throughput on Neptune is to both batch multiple writes into a single requests and then do multiple batched writes in parallel. Neptune stores each atomic component of the graph as separate records (node, edge, and property). For example, if you have a node with 4 properties, that turns into 5 records in Neptune. A batched write query with around 100-200 records is a sweet spot that we've found in testing. So issuing queries with that many records and running those in parallel should provide better throughput. Conditional writes will slow things down, as additional locks are being taken to ensure data consistency. So writes that use straight addV(), addE(), property() steps will be faster than using mergeV() or mergeE(). The latter can also incur more deadlocks (exposed in Neptune as ConcurrentModificationExceptions). So it is also good practice to implement exponential backoff and retries whenever doing parallel writes into Neptune....

Why is T.label immutable and do we have to create a new node to change a label?

We cannot do g.V('some label').property(T.label, 'new label').iterate() ? Is this correct? Thank you
Solution:
you have a few questions here @Julius Hamilton
Why is T.label immutable
i'm not sure there's a particular reason except to say that many graphs have not allowed that functionality so TinkerPop hasn't offered a way to do it.
and do we have to create a new node to change a label?...

Simple question about printing vertex labels

I am creating a graph in the gremlin cli by doing graph = TinkerGraph.open(), g = graph.traversal(), g.addV("somelabel"). i can confirm a vertex was created. i can do g.V().valueMap(true) and it shows ==>[id:0,label:documents]. But I so far do not know how to print information about a vertex via its index. I have tried g.V(0) but it doesnt print anything.
Solution:
By default, IDs are stored as longs. You likely need to use g.V(0L) in Gremlin Console to return the vertex that you created.

Defining Hypergraphs

I want to create a software system where a person can create labeled nodes, and then define labeled edges between the nodes. However, edges also count as nodes, which means you can have edges between edges and edges, edges and nodes, edges-between-edges-and-nodes and edges-between-nodes-and-nodes, and so on. This type of hypergraph is described well here in this Wikipedia article:
One possible generalization of a hypergraph is to allow edges to point at other edges. There are two variations of this generalization. In one, the edges consist not only of a set of vertices, but may also contain subsets of vertices, subsets of subsets of vertices and so on ad infinitum. In essence, every edge is just an internal node of a tree or directed acyclic graph, and vertices are the leaf nodes. A hypergraph is then just a collection of trees with common, shared nodes (that is, a given internal node or leaf may occur in several different trees). Conversely, every collection of trees can be understood as this generalized hypergraph....
Solution:
I always understood that back in the day Marko and crew decided that hypergraphs can be modeled by a property graph. You have to stick a vertex in the middle that represents the hyper edge. This leaves the query language without any first class constructs about navigating a hyper edges but everything is reachable. Another problem would be performance, the more abstraction away from the implementation on disc, the slower the graph becomes....

JanusGraph AdjacentVertex Optimization

Hiya, I'm wondering if anyone has any advice on how to inspect the provider-side optimizations being applied to my gremlin code by janus graph. Currently when I call explain I get the following output. ``` Original Traversal [GraphStep(vertex,[]), HasStep([plabel.eq(Person)])@[a], VertexStep(OUT,vert...
Solution:
TinkerPop applies all optimization strategies to all queries (including JanusGraph internal optimizations). However, JanusGraph skips some of the optimizations as it sees necessary. We don't currently store information if the optimization strategy modified any part of the query or was simply skipped (potential feature request). Thus, the way I would test if the optimization strategy actually makes any changes or not is to debug the query with the breaking point placed in the necessary optimization strategy. I.e. in your case I would place a breaking point here: https://github.com/JanusGraph/janusgraph/blob/c9576890b5e9dc48676ccc16a58552b8a665e5f0/janusgraph-core/src/main/java/org/janusgraph/graphdb/tinkerpop/optimize/strategy/AdjacentVertexOptimizerStrategy.java#L58C13-L58C28 If this part is triggered during your query execution - optimization works in this case....

Efficient degree computation for traversals of big graphs

Hello, We're trying to use Neptune with gremlin for a fairly big (XX m nodes) graph and our queries usually have to filter out low degree vertices at some point in the query for both efficiency and product reasons. According to the profiler, this operation takes the big brunt of our computation time. At the same time, the computation of the degree is something that could be pre-computed on the database (we dont even need it to be 100% accurate, a "recent" snapshot computation would be good enough) which would significantly speed up the query. Is there anyone here who has some trick up their sleeve for degree computation that would work well, either as an ad-hoc snippet for a query or as a nice way to precompute it on the graph? ...
Solution:
I would first mention that the profile() step in Gremlin is different than the Neptune Profile API. The latter is going to provide a great deal more info, including whether or not the entire query is being optimized by Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html If you have 10s of millions of nodes, you could use Neptune Analytics to do the degree calculations. Then extract the degree properties from NA, delete the NA graph, and bulk load those values back into NDB. We're working to make this round-trip process more seamless. But it isn't too hard to automate in the current form....

Gremlin query to order vertices with some locked to specific positions

I'm working with a product catalog in a graph database using Gremlin. The graph structure includes: 1. Product vertices...
Solution:
you can play tricks like this to move the 0 index to last place, but that still leaves the 7 one row off ``` gremlin> g.V().hasLabel("Category"). ......1> inE("belongsTo").as('a').outV(). ......2> path()....

Is it possible to configure SSL with PEM certificate types?

Hi all, I'm new to this group and currently working getting an implementation of Gremlin (Aerospike Graph) to listen over SSL. The certificates we get from our provider's API are only served in PEM format. It appears, according to the documentation that the keyStoreType and trustStoreType either JKS or PKCS12 format: https://tinkerpop.apache.org/javadocs/current/full/org/apache/tinkerpop/gremlin/server/Settings.SslSettings.html Is this true? Is there any way for us to configure SSL with PEM format certificates?...
Solution:
Hi @joshb, am I correct in assuming you are using the Java driver to connect to Aerospike? The java driver uses the JSSE keyStore and trustStore, which as far as I understand does not support the PEM format. You may be able to use a 3rd party tool such as openssl to convert from PEM to PKCS12 (https://docs.openssl.org/1.1.1/man1/pkcs12/). Perhaps @aerospike folks may have more direct recommendations for driver configuration....

Query works when executed in console but not in javascript

``` const combinedQuery = gremlin.V(profileId) .project('following', 'follows') .by( __.inE('FOLLOWS').outV().dedup().id().fold()...

Very slow regex query (AWS Neptune)

We have a query that searches a data set of about ~400,000 vertices, matching properties using a case insensitive TextP.regex() expression. We are observing very bad query performance; even after several other optimizations, it still takes 20-45 seconds, often timing out. Simplified query: ``` g.V()...
Solution:
When Neptune stores data it stores it in 3 different indexed formats (https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html#feature-overview-storage-indexing), each of which are optimized for a specific set of common graph patterns. Each of these indexes is optimized for exact match lookups so when running queries that require partial text matches, such as a regex query, all the matching property data needs to be scanned to see if it matches the provided expression.
To get a performant query for partial text matches the suggestion is to use the Full Text search integration (https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html) , which will integrate with OpenSearch to provide robust full text searching capabilities within a Gremlin query...

How can we extract values only

"Latitude", { "@type": "g:Double", "@value": 45.2613104 },...
Solution:
If I understand this correctly, you are first trying to take the result of a gremlin query that returns Latitude and Longitude (like in the initial post you made), and use those values in the in the math() step that calculates the Haversine formula (your latest post). If that is the case, you have two options. 1. You should combine this into one Gremlin query. You can save the results of the Latitude and Longitude to variables or use them in a by() modulator to the math step. Assuming that those values are properties on a vertex called 'lat' and 'lon' it would look something like g.V().project('Latitude', 'Longitude').by('lat').by('lon').math(...). You would replace the ... in the math() step with the Haversine formula. 2. If you want to keep these as two separate queries, then you should use one of the Gremlin Languages Variants (GLVs) which are essentially drivers that will automatically deserialize the result into the appropriate type so you don't have to deal with the GraphSON (which is what your initial post shows). Read triggan's answer above for more details about that....

How to speed up gremlin query

Hi, I am working with Janusgraph and my query is taking a while to execute (around 2.8 seconds), but I would like it to be faster. I read that I should create a composite index to improve speed and performance or something of that sort, but I am unfamiliar with how to do that in Python. Here is my query: g.V().has("person", "name", "Bob").outE("knows").has("weight", P.gte(0.5)).inV().values("name").toList() What my query does is it finds all the nodes that Bob has relationship "knows" with, as long as the weight of the edge to those nodes are >=0.5. Bob has around ~600 nodes that it's connected to with the "knows" relationship. It's fairly slow and takes 2.5-2.8 secs to complete....
Solution:
I read that I should create a composite index to improve speed and performance or something of that sort, but I am unfamiliar with how to do that in Python.
as a point of clarification around indices, you wouldn't likely do that step with python. you typically use gremlinpython just to write Gremlin. For index management, you need to use JanusGraph's APIs directly. often those are just commands you would execute directly in Gremlin Console against a JanusGraph instance....

logging and alerting inside a gremlin step

I am trying to add a log statement inside the step but I am getting an error Server error: null (599) ResponseError: Server error: null (599) This is the error we get // Log the creation of a shell flight .sideEffect(() => logger.info(Shell flight created for paxKey ${paxKey} with flightId ${flightLegID})) ...
Solution:
Since you are getting a "Server error" I assume you are using Gremlin in a remote context. I'd further assume you are not sending a script to the server, but are using bytecode or are using a graph like Amazon Neptune. Those assumptions all point to the fact that you can't use a lambda that way in remote contexts. The approach you are using to write your lambda is for embedded use cases only (i.e. where query is executed in the same JVM where it was created). If you want to send a lambda remotely you would need to have a server that supports them (e.g. Neptune does not, but Gremlin Server with Groovy ScriptEngine processing does) and then follow these instructions: https://tinkerpop.apache.org/docs/current/reference/#gremlin-java-lambda The other thing to consider is that the lambda will be executed remotely so it might not know what "logger" is and the log message will appear on the server note the client. ...

Optimizing connection between Python API (FastAPI) and Neptune

Hi guys. I've been working with gremlin python in my company for the past 4 years, using Neptune as the database. We are running a FastAPI server, where Neptune has been the main database since the beginning. We always have been struggling to get a good performance on the API, but recently it has become a more latent pain, with endpoints taking more than 10s to respond. We took some actions trying to improve this perfomance, such as updating the cluster to the latest engine version, and the same for FastAPI and gremlin-python dependencies. ...
Solution:
There's a lot to unpack here.... 1. We state in our docs that t4g.medium instances are really not great for production workloads. We support them for initial development so users can keep cost down, but the amount of resources available, and the fact that they are burstable instances, really constrains their usability. Once you've used up CPU credits, you're going to get throttled. 2. Neptune's concurrency model is based on instance size and the number of vCPUs per instance. For each vCPU, there are two query execution threads. So on a t4g.medium or an r6g.large instance, there are 2 vCPUs. That means that instance can only be computing 4 concurrent requests at a time. If you need more concurrency, then you should look to scale to a larger instance with more vCPUs. If you're workload varies over time, you may want to investigate using Neptune Serverless, which can automatically scale vertically to meet the needs of the application. There's a good presentation from last year's re:Invent that discusses when Serverless works best and when not to use it: https://youtu.be/xAdWa0Ahiok?si=OeSe-_L3ErcYH-XU...

## Breadth-First Traversal Fact Check

Using the Neptune Query Profiling, I have found out that Gremlin queries seems to use depth first strategy to search things and as a result it tends to be both time and resource intensive especially when what I am looking for is a node just a 1 or 2 levels below. To do a Breadth-First Traversal the following approach has been suggested, but not sure if this really does the trick. If my goal is to find nearest nodes quickly, what could be efficient approaches?...
Solution:
Neptune uses BFS as the default traversal strategy. You can change the method in which a repeat() is executed via the query hint as noted here: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-query-hints-repeatMode.html

How to find the edges of a node that have a weight of x or greater?

I have a graph with nodes that are connected to one another, and I have weights on the edges that connect these nodes. On this query g.V().has("person", "name", "A").out("is friends with").values("name").to_list() returns the names of people that person "A" has the relation "is friends with". But I would like to filter by the weight value. Person A has a weight of 0.9 with Person B, and a weight of 0.7 with Person C. I would like to only get back the people person A has an edge with a weight of greater than 0.8, how can I do that? I am using gremlin-python. I have tried g.V().has("person", "name", "A").out("is friends with").has("weight", gte(0.8)) but I get an error saying NameError: name "gte" is not defined. Did you mean: 'g'?...
Solution:
g.V().has("person", "name", "A").out(... returns Vertices to get edges need to use outE() something like a_friends = g.V().has("person", "name", "A").outE("is friends with").has("weight", P.gt(0.75)).inV().to_list()...

op_traversal P98 Spikes

Hi TinkerPop team! I'm observing these abrupt spikes in my gremlin server P98 metrics in my JanusGraph environment. I've been looking at the TraversalOpProcessor (https://github.com/apache/tinkerpop/blob/master/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin/server/op/traversal/TraversalOpProcessor.java) code over the last couple days for some ideas of what could be causing it but I'm not seeing an obvious smoking gun so figured I'd ask around. The traversal in question is being submitted from Rust via Bytecode using gremlin-rs. Specifically with some additions I've made that added mergeV & mergeE among other things. It's in a PR awaiting the maintainer's review, but that's not really relevant to my question, but over here if you want to see it: https://github.com/wolf4ood/gremlin-rs/pull/214. The GraphSONV3 serializer is what's being used by the library....
Solution:
For anyone else that finds this thread the things I ended up finding to be issues: - Cassandra's disk throughput I/O (EBS gp3 is 125MB/s by default, at least for my use case a I was periodically maxxing that out, increasing to 250MB/s resolved that apparent bottleneck). So if long sustained writing occured the 125MB/s was not sufficent. - Optimizing traversals to using mergeE/mergeV that were either older groovy-script based evaluations I was submitting or older fold().coalesce(unfold(),...) style "get or create" based vertex mutations....
No description