triggan
triggan
ATApache TinkerPop
Created by Wolfgang Fahl on 10/16/2024 in #questions
pymogwai
Yes, referring to porting the Java implementation of TinkerGraph to other runtimes. Not totally sure of the issues involved in doing this. More likely an issue of prioritization. But having TinkerGraph native in a runtime would open the doors for a few things that you can only do with TinkerGraph in Java. For example, the use of subgraph() in Gremlin-Java returns a TinkerGraph object that you can then issue queries against. It's a common pattern when you may want to return a subgraph locally (as a cache) and run queries against a locally cached subgraph. Today, if you use subgraph() in a non-Java client, you get back different representations of the subgraph via the different serializers. Usually in GraphSON or some for of map/JSON.
6 replies
ATApache TinkerPop
Created by Wolfgang Fahl on 10/16/2024 in #questions
pymogwai
I would agree that this is interesting. There have been a number of situations where having TinkerGraph available in other runtimes would be useful. I'm curious why you chose to implement something different versus looking to add TinkerGraph to gremlinpython.
6 replies
ATApache TinkerPop
Created by Alex on 10/10/2024 in #questions
Neptune Cluster Balancing Configuration
Yes, so that is using websockets (although, if you're using Neptune, the connection string should start with wss as Neptune is SSL/TLS only).
5 replies
ATApache TinkerPop
Created by Alex on 10/10/2024 in #questions
Neptune Cluster Balancing Configuration
What are you using to send queries to Neptune? Are you using the gremlin-python client and connecting via websockets? If so, each websocket connection is going to act like a "sticky session". It will connect to the same instance for the life of the connection. The reader endpoint is a DNS endpoint that is configured to resolve to a different read replica approximately every 5 seconds. So depending on when you establish your websocket connections or if you're just sending http requests, those could all go to the same instance if sent in quick succession. Customers have solved this in a number of ways. Some will create load balancers in front of Neptune read replicas that can more precisely "load balance" requests across the instances. We also created a version of the Gremlin Java client that establishes connection pools across multiple reader instances: https://github.com/aws/neptune-gremlin-client Doing this in Python with the Gremlin Python client is not as straight-forward than the Java client. The Java client has the concept of a "cluster" whereas the Python client does not. So you may need to build a routing mechanism that creates connections to each reader instance directly (each instance has it's own instance endpoint) and iterate across those connections in a round-robin fashion if you want to get even distribution. Totally understand this is a pain and we've been discussing ways to address this.
5 replies
ATApache TinkerPop
Created by Max on 9/28/2024 in #questions
Best practices for local development with Neptune.
Many of the differences there are a little opinionated from the person that wrote that blog post. When I create a Gremlin Server docker container to emulate Neptune (as close as possible), I typically just use something like the following (Dockerfile) and comments explain the changes:
FROM tinkerpop/gremlin-server:latest

# Add support for both websockets and http requests
RUN sed -i "s|^channelizer:.*|channelizer: org.apache.tinkerpop.gremlin.server.channel.WsAndHttpChannelizer|" ./conf/gremlin-server.yaml

# Allow for string based IDs
RUN sed -i "s|^gremlin.tinkergraph.vertexIdManager=.*|gremlin.tinkergraph.vertexIdManager=ANY|" ./conf/tinkergraph-empty.properties

# Remove ReferenceElementStrategy for use with graph-explorer - return all properties
RUN sed -i "s|^globals << \[g.*|globals << [g : traversal().withEmbedded(graph).withStrategies()]|" ./scripts/empty-sample.groovy

# Increase thread stack to 2m
ENV JAVA_OPTIONS="-Xss2m -Xms512m -Xmx4096m"
FROM tinkerpop/gremlin-server:latest

# Add support for both websockets and http requests
RUN sed -i "s|^channelizer:.*|channelizer: org.apache.tinkerpop.gremlin.server.channel.WsAndHttpChannelizer|" ./conf/gremlin-server.yaml

# Allow for string based IDs
RUN sed -i "s|^gremlin.tinkergraph.vertexIdManager=.*|gremlin.tinkergraph.vertexIdManager=ANY|" ./conf/tinkergraph-empty.properties

# Remove ReferenceElementStrategy for use with graph-explorer - return all properties
RUN sed -i "s|^globals << \[g.*|globals << [g : traversal().withEmbedded(graph).withStrategies()]|" ./scripts/empty-sample.groovy

# Increase thread stack to 2m
ENV JAVA_OPTIONS="-Xss2m -Xms512m -Xmx4096m"
11 replies
ATApache TinkerPop
Created by Max on 10/2/2024 in #questions
Confusing behavior of `select()`.
No description
11 replies
ATApache TinkerPop
Created by Max on 10/2/2024 in #questions
Confusing behavior of `select()`.
This might be where you need to use a where()-by()-by() pattern.
g.withSideEffect("map", [3: "foo", 4: "bar"]).
inject("a", "b", "c", "d").
aggregate(local, "x").
map(select("x").count(local)).as('cnt').
select('map').unfold().select(values).
where('map',eq('cnt')).by(unfold().select(keys)).by()
g.withSideEffect("map", [3: "foo", 4: "bar"]).
inject("a", "b", "c", "d").
aggregate(local, "x").
map(select("x").count(local)).as('cnt').
select('map').unfold().select(values).
where('map',eq('cnt')).by(unfold().select(keys)).by()
==> foo
==> bar
==> foo
==> bar
I don't recall what the issue is with select(select(somekey))
11 replies
ATApache TinkerPop
Created by Max on 9/28/2024 in #questions
Best practices for local development with Neptune.
There's a blog post here that contains some of the details on what properties you can change in TinkerGraph to get close: https://aws.amazon.com/blogs/database/automated-testing-of-amazon-neptune-data-access-with-apache-tinkerpop-gremlin/ It's unlikely that you'll find anything that emulates things like the result cache, lookup cache, full-text-search, features, etc.
I would be curious to hear what the needs are for local dev.
11 replies
ATApache TinkerPop
Created by Julius Hamilton on 9/24/2024 in #questions
Why is T.label immutable and do we have to create a new node to change a label?
"Ok" is of opinion. It's entirely possible and a tradeoff in supporting multiple frameworks and query languages.
20 replies
ATApache TinkerPop
Created by Julius Hamilton on 9/24/2024 in #questions
Why is T.label immutable and do we have to create a new node to change a label?
Neptune has very few constraints. Primarily only that each vertex and edge must have a unique ID (and those IDs must be a String). Each vertex and edge must have a label (though if not defined at creation, a label of vertex or edge is used). And that every edge must have a vertex at either end of the edge. Outside of that, there are no constraints on what properties a vertex/edge must have, data types, etc.
20 replies
ATApache TinkerPop
Created by Julius Hamilton on 9/24/2024 in #questions
Why is T.label immutable and do we have to create a new node to change a label?
Also to Dave's point, while Gremlin doesn't allow for you to change a label on a node, openCypher does. There are a lot of interoperability aspects that have driven how Neptune supports labels.
Using multiple labels in Gremlin on Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-differences.html#feature-gremlin-differences-labels You can change a label on a vertex/node in openCypher via:
MATCH (n)
WHERE id(n) = '96c90f92-9aef-6f62-14c6-b1882b66188c'
SET n:newuser
REMOVE n:user
RETURN n
MATCH (n)
WHERE id(n) = '96c90f92-9aef-6f62-14c6-b1882b66188c'
SET n:newuser
REMOVE n:user
RETURN n
( where the old label is user and new label is newuser ) This also brings up the point that openCypher supports the ability to do this, while Gremlin does not. So likely a gap that Gremlin needs to address even without addressing this from a schema/constraint standpoint.
20 replies
ATApache TinkerPop
Created by Alex on 9/24/2024 in #questions
How to improve Performance using MergeV and MergeE?
In general, the method to get the best write performance/throughput on Neptune is to both batch multiple writes into a single requests and then do multiple batched writes in parallel. Neptune stores each atomic component of the graph as separate records (node, edge, and property). For example, if you have a node with 4 properties, that turns into 5 records in Neptune. A batched write query with around 100-200 records is a sweet spot that we've found in testing. So issuing queries with that many records and running those in parallel should provide better throughput. Conditional writes will slow things down, as additional locks are being taken to ensure data consistency. So writes that use straight addV(), addE(), property() steps will be faster than using mergeV() or mergeE(). The latter can also incur more deadlocks (exposed in Neptune as ConcurrentModificationExceptions). So it is also good practice to implement exponential backoff and retries whenever doing parallel writes into Neptune.
4 replies
ATApache TinkerPop
Created by Julius Hamilton on 9/22/2024 in #questions
Simple question about printing vertex labels
By default, IDs are stored as longs. You likely need to use g.V(0L) in Gremlin Console to return the vertex that you created.
5 replies
ATApache TinkerPop
Created by Julius Hamilton on 9/20/2024 in #questions
Defining Hypergraphs
TBH, this may be easier to do in RDF than in LPG. 🙂 Triples can define relationships between entities (vertices). And you can use named graphs in RDF to wrap a triple into another entity/id. Then those can be entities themselves that can have other relationships. From your example diagram above, I could have:
<S> <P> <O> <G>
<vertex:1> <rdf:type> <type:vertex> <edge:1>
<edge:2> <rdf:type> <type:hyperedge> <edge:1>
<vertex:2> <rdf:type> <type:vertex> <edge:2>
<vertex:3> <rdf:type> <type:vertex> <edge:2>
<vertex:3> <rdf:type> <type:vertex> <edge:3>
<vertex:5> <rdf:type> <type:vertex> <edge:3>
<vertex:6> <rdf:type> <type:vertex> <edge:3>
<vertex:4> <rdf:type> <type:vertex> <edge:4>
<vertex:7> <rdf:type> <type:vertex> <~> #default graph
<edge:1> <rdf:type> <type:hyperedge> <~>
<edge:4> <rdf:type> <type:hyperedge> <~>
<edge:3> <rdf:type> <type:hyperedge> <~>
<S> <P> <O> <G>
<vertex:1> <rdf:type> <type:vertex> <edge:1>
<edge:2> <rdf:type> <type:hyperedge> <edge:1>
<vertex:2> <rdf:type> <type:vertex> <edge:2>
<vertex:3> <rdf:type> <type:vertex> <edge:2>
<vertex:3> <rdf:type> <type:vertex> <edge:3>
<vertex:5> <rdf:type> <type:vertex> <edge:3>
<vertex:6> <rdf:type> <type:vertex> <edge:3>
<vertex:4> <rdf:type> <type:vertex> <edge:4>
<vertex:7> <rdf:type> <type:vertex> <~> #default graph
<edge:1> <rdf:type> <type:hyperedge> <~>
<edge:4> <rdf:type> <type:hyperedge> <~>
<edge:3> <rdf:type> <type:hyperedge> <~>
Good article here that talks about this more, but without using named graphs: https://ontologist.substack.com/p/hypergraphs-and-rdf You can sort of back into LPG from this, but just easier (for me, at least) to think of things in terms of triples (vertex-edge-vertex) and a triple being an entity that can also have it's own relationships.
6 replies
ATApache TinkerPop
Created by emi on 9/10/2024 in #questions
Efficient degree computation for traversals of big graphs
6 replies
ATApache TinkerPop
Created by emi on 9/10/2024 in #questions
Efficient degree computation for traversals of big graphs
I would first mention that the profile() step in Gremlin is different than the Neptune Profile API. The latter is going to provide a great deal more info, including whether or not the entire query is being optimized by Neptune: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html If you have 10s of millions of nodes, you could use Neptune Analytics to do the degree calculations. Then extract the degree properties from NA, delete the NA graph, and bulk load those values back into NDB. We're working to make this round-trip process more seamless. But it isn't too hard to automate in the current form.
6 replies
ATApache TinkerPop
Created by Andys1814 on 8/26/2024 in #questions
Very slow regex query (AWS Neptune)
To Dave's point... there are some data modeling strategies that you can use to eliminate the need for OpenSearch and use exact match references in your cluster: 1) If case is an issue, create an all lower or upper-case version of the property value and use that to match. 2) If searching for a specific term (and when you know you might be looking for that term again in the future), just do a one-time search + add for that term. Properties in Neptune have a default cardinality of set, so each property can contain multiple values. A good example of this is if I wanted to do a lookup of all movies in a database that contained "Star Wars: Episode" in the title.
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).values('title')
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).values('title')
This query takes ~5.5s to run across all movies in my dataset. I can add Star Wars: Episode as an additional title property value to each of those via:
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).property('title','Star Wars: Episode')
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).property('title','Star Wars: Episode')
Then I can just use a query like:
g.V().hasLabel('movie').has('title','Star Wars: Episode').values('title')
g.V().hasLabel('movie').has('title','Star Wars: Episode').values('title')
which now only takes 1.3ms to run to find the results. You could also store the exact match term as a separate property value too. I'm just storing it back into title for simplicity sake here. This pattern is only beneficial when you know you're going to need to find things multiple times. For adhoc searches, using OpenSearch is really the only solution. Neptune has no means, at present, to build an internal full-text-search index.
7 replies
ATApache TinkerPop
Created by Balan on 8/23/2024 in #questions
How can we extract values only
What client are you using? Are you using one of the Gremlin Language Variants (i.e. gremlin-python, Gremlin-Javascript, etc.)? If so, which one? Or are you using the AWS SDK and the NeptuneData execute_gremlin_query API? Each of these have their own method to specify the serialization format. You can choose a serialization format for GraphSON that does not return data types: https://tinkerpop.apache.org/docs/current/reference/#_graphson, as @Kennh mentioned. For example, the gremlin-python client you would use something like:
g = traversal().with_remote(DriverRemoteConnection('ws://localhost:8182/gremlin','g',
message_serializer = gremlin_python.driver.serializer.GraphSONUntypedMessageSerializerV1))
g = traversal().with_remote(DriverRemoteConnection('ws://localhost:8182/gremlin','g',
message_serializer = gremlin_python.driver.serializer.GraphSONUntypedMessageSerializerV1))
With the NeptuneData API call in boto3:
response = client.execute_gremlin_query(
gremlinQuery='g.V().valueMap()',
serializer='GraphSONUntypedMessageSerializerV1'
)
response = client.execute_gremlin_query(
gremlinQuery='g.V().valueMap()',
serializer='GraphSONUntypedMessageSerializerV1'
)
9 replies
ATApache TinkerPop
Created by Vitor Martins on 8/8/2024 in #questions
Optimizing connection between Python API (FastAPI) and Neptune
Sounds good. We're more than happy to help. We can keep the thread open. Also happy to jump on a call and talk things through, as needed.
11 replies
ATApache TinkerPop
Created by Vitor Martins on 8/8/2024 in #questions
Optimizing connection between Python API (FastAPI) and Neptune
3. Websockets are great if you have a workload that is constantly sending requests or if you're taking advantage of streaming data back to a client. However, they do come with the overhead of needing to maintain these connections and handle reconnects when these connections die. Generally these connections are long-lived in Neptune, except in the case when they are idle (>20-25 minutes) or in the event that you're using IAM Authentication (we terminate any connection older than 10 days regardless of idle or not). If your workload isn’t taking advantage of websockets, you may want to consider moving to basic http requests instead. Neptune now has a NeptuneData API as part of the boto3 SDK that you can use to send requests to a Neptune endpoint as an http request without the need of a live websocket connection. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/neptunedata.html

 4. Count queries in Neptune are full scans. So any time you need to do a groupCount().by(label) for all vertices, this is going to scan the entire database. If you just want a count of nodes or nodes with a certain property, there’s also the Summary API that you can use for this: https://docs.aws.amazon.com/neptune/latest/userguide/neptune-graph-summary.html
11 replies