triggan
ATApache TinkerPop
•Created by Coldfire on 12/10/2024 in #questions
Parameterized edges creation in existing graph
You've stumbled upon a common gap in Gremlin...
has()
steps cannot currently take a traversal as an argument. It's listed as a roadmap item for a future TinkerPop 4.x release: https://github.com/apache/tinkerpop/blob/087b3070914123055d3e4ededc2550f12715a0b4/docs/src/dev/future/index.asciidoc#has-traversal11 replies
ATApache TinkerPop
•Created by Alex on 12/11/2024 in #questions
How to create indexes by Label?
The data modeling looks a bit odd here. I'm not sure I would try to use any component of an edge ID as a filter. At that point, you're sort of attempting to use the edges to model some form of entity. This can be a bit of an anti-pattern. Edges are meant to represent relationships (actions, verbs) in a graph where nodes/vertices are meant to represent entities (nouns, things). If this is a common query pattern, you may want to look at further de-normalizing the data model and creating a labeled node of Tenant. Executing a query of
g.V(<client_id>).repeat(both(<list_of_edge_labels>).simplePath()).times(2).path()
should perform better than what you currently have.11 replies
ATApache TinkerPop
•Created by Alex on 12/11/2024 in #questions
How to create indexes by Label?
just by calling g.V().limit(1) with concurrent calls on an r6g.2xlarge machine, the average time is 250msHow may concurrent calls? An
r6g.2xlarge
instance has 8 vCPUs (and 16 available query execution threads). If you're issuing more than 16 requests in parallel, any additional concurrent requests will queue (an instance can queue up to 8000 requests). You can see this with the /gremlin/status
API (of %gremlin_status
Jupyter magic) with the number of executing queries and the number of "accepted" queries. If you need more concurrency, then you'll need to add more vCPUs (either by scaling up or scaling out read replicas).
But in the query mentioned, the bottleneck starts at the stage where it calls the last otherV() before path(). g.V().has(T.id, "client-id-uuid").bothE("has_profile", "has_affiliated", "has_controlling").has(T.id, containing("tenant-id-uuid")).otherV().path().unfold().dedup().elementMap().toList()Makes sense as you're using a text predicate here (
containing()
). Neptune does not maintain a Full Text Search index. So any use of text predicates as containing()
, startingWith()
, endingWith()
etc. will incur some form of range scan and also require dictionary materialization (we lose all of the benefits of data compression here as each value must be fetched from the dictionary to compare with the predicate value you've provided).11 replies
ATApache TinkerPop
•Created by Alex on 12/11/2024 in #questions
How to create indexes by Label?
This is might be one of your issues:
Neptune does not have a Full Text Search index. Using any of the text predicates (i.e.
containing()
, startswith()
, endswith
, etc.) will require dedictionarifying the values for each of the solutions up to that portion of the query. If that is a common pattern, I might suggest using a different property so you can do just a has(key, value)
filter.11 replies
ATApache TinkerPop
•Created by Alex on 12/11/2024 in #questions
How to create indexes by Label?
If you have some more details on the query, I might be able to help determine the best way to (re)write that to take advantage of the current indexes. Or potentially rework your data model to better fit within the existing indexes.
11 replies
ATApache TinkerPop
•Created by Alex on 12/11/2024 in #questions
How to create indexes by Label?
Neptune's indexing structure can be explained here: https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html
11 replies
ATApache TinkerPop
•Created by Alex on 12/11/2024 in #questions
How to create indexes by Label?
the AWS Neptune experts suggested that I create some indexesWhich experts were these? Neptune doesn't support the creation of indexes beyond the 3 native indexes that are created by default. There's a fourth optional index, but only needed for very specific use cases: https://docs.aws.amazon.com/neptune/latest/userguide/features-lab-mode.html#features-lab-mode-features-osgp-index
11 replies
ATApache TinkerPop
•Created by Coldfire on 12/10/2024 in #questions
Parameterized edges creation in existing graph
merge steps were released in 3.7.x
11 replies
ATApache TinkerPop
•Created by Coldfire on 12/10/2024 in #questions
Parameterized edges creation in existing graph
The merge steps are fairly new. So hard to say this isn't something people "usually do" as we're still deriving patterns on how best to use those steps.
11 replies
ATApache TinkerPop
•Created by Alex on 11/28/2024 in #questions
How to Work with Transactions with Gremlin Python
You may see write throughput exceed 120,000 in some cases. There are a number of dependencies that drive that. But that's the safe number to use when estimating load speed/rates.
10 replies
ATApache TinkerPop
•Created by Alex on 11/28/2024 in #questions
How to Work with Transactions with Gremlin Python
If you're looking to optimize for write throughput on Neptune, you want to consider the following:
- For each write requests, attempt to batch 100-200 "object" into a single write request/query. An "object" would be any combination of a vertex, edge, or subsequent vertex/edge properties (vertex with 4 properties == 5 "objects").
- Use parallel write requests. If using Python, consider using
multiprocessing
to create separate processes. They can share a connection pool to Neptune if you so choose. The number of parallel processes should equal the number of query execution threads available on your Neptune writer instance (which is equal to 2x the number of vCPUs on whatever size instance you're using).
If you follow those guidelines, you should get similar performance to what you would see with Neptune's bulk loader. Note that conditional writes will have overhead. If using mergeV()
, you're unlikely to see the same write throughput as Neptune's bulk loader as the bulk loader is not doing conditional writes.
Neptune's "top speed" for write throughput is going to be about 120,000 "objects" per second when writing vertex and vertex properties and about half of that when writing edges (due to vertex reference checks when creating an edge). These numbers can only be attained if using a x.12xlarge writer instance or larger. Smaller instances will scale linearly in terms of throughput.10 replies
ATApache TinkerPop
•Created by Coldfire on 12/10/2024 in #questions
Parameterized edges creation in existing graph
The issue here is seeing duplicates when trying to find the matching pairs to create the edges. I have solution that maybe close, but this creates duplicate edges (one in each direction):
Note that this will not work in Gremlify, as this uses the
Basically, this does a cartesian join of all vertices with label "E" to all other vertices of label "E" and then filters on pairs that have different IDs but the same property of "bName" or "cName". At that point, you end up with a list of maps of paired vertices. That then needs to be converted into the map format supported by
merge()
step that was introduced in 3.7.x. Though I tested this on Neptune and it works fine. It takes a bit of Gremlin "hackery" to create the map that you pass into the mergeE()
step at the end.Basically, this does a cartesian join of all vertices with label "E" to all other vertices of label "E" and then filters on pairs that have different IDs but the same property of "bName" or "cName". At that point, you end up with a list of maps of paired vertices. That then needs to be converted into the map format supported by
mergeE()
, which all of the merge()
steps accomplish.11 replies
ATApache TinkerPop
•Created by Coldfire on 12/10/2024 in #questions
Parameterized edges creation in existing graph
Nvm... I think I see it now. You're trying to connect vertices based on common properties. Does direction matter?
11 replies
ATApache TinkerPop
•Created by Coldfire on 12/10/2024 in #questions
Parameterized edges creation in existing graph
Are you trying to create a fully connected graph from all vertices with a label of
E
?11 replies
ATApache TinkerPop
•Created by Alex on 11/28/2024 in #questions
How to Work with Transactions with Gremlin Python
I received this suggestion to use tranctions to try to have more performance than using query string, thats why I`m try to implement it and check the difference in performance.Unsure where this is coming from. What sort of performance gain are you looking for?
If you're using Gremlin Server, what backing store are you using? TinkerGraph? If so, ensure you're using TinkerTransactionGraph.
There's more on how to use TinkerTransactionGraph for unit testing of transactions here: https://aws.amazon.com/blogs/database/unit-testing-apache-tinkerpop-transactions-from-tinkergraph-to-amazon-neptune/
10 replies
ATApache TinkerPop
•Created by Wolfgang Fahl on 10/16/2024 in #questions
pymogwai
Yes, referring to porting the Java implementation of TinkerGraph to other runtimes. Not totally sure of the issues involved in doing this. More likely an issue of prioritization. But having TinkerGraph native in a runtime would open the doors for a few things that you can only do with TinkerGraph in Java. For example, the use of
subgraph()
in Gremlin-Java returns a TinkerGraph object that you can then issue queries against. It's a common pattern when you may want to return a subgraph locally (as a cache) and run queries against a locally cached subgraph. Today, if you use subgraph()
in a non-Java client, you get back different representations of the subgraph via the different serializers. Usually in GraphSON or some for of map/JSON.6 replies
ATApache TinkerPop
•Created by Wolfgang Fahl on 10/16/2024 in #questions
pymogwai
I would agree that this is interesting. There have been a number of situations where having TinkerGraph available in other runtimes would be useful. I'm curious why you chose to implement something different versus looking to add TinkerGraph to
gremlinpython
.6 replies
ATApache TinkerPop
•Created by Alex on 10/10/2024 in #questions
Neptune Cluster Balancing Configuration
Yes, so that is using websockets (although, if you're using Neptune, the connection string should start with
wss
as Neptune is SSL/TLS only).5 replies
ATApache TinkerPop
•Created by Alex on 10/10/2024 in #questions
Neptune Cluster Balancing Configuration
What are you using to send queries to Neptune? Are you using the gremlin-python client and connecting via websockets? If so, each websocket connection is going to act like a "sticky session". It will connect to the same instance for the life of the connection.
The reader endpoint is a DNS endpoint that is configured to resolve to a different read replica approximately every 5 seconds. So depending on when you establish your websocket connections or if you're just sending http requests, those could all go to the same instance if sent in quick succession.
Customers have solved this in a number of ways. Some will create load balancers in front of Neptune read replicas that can more precisely "load balance" requests across the instances.
We also created a version of the Gremlin Java client that establishes connection pools across multiple reader instances: https://github.com/aws/neptune-gremlin-client
Doing this in Python with the Gremlin Python client is not as straight-forward than the Java client. The Java client has the concept of a "cluster" whereas the Python client does not. So you may need to build a routing mechanism that creates connections to each reader instance directly (each instance has it's own instance endpoint) and iterate across those connections in a round-robin fashion if you want to get even distribution.
Totally understand this is a pain and we've been discussing ways to address this.
5 replies
ATApache TinkerPop
•Created by Max on 9/28/2024 in #questions
Best practices for local development with Neptune.
Many of the differences there are a little opinionated from the person that wrote that blog post. When I create a Gremlin Server docker container to emulate Neptune (as close as possible), I typically just use something like the following (Dockerfile) and comments explain the changes:
11 replies