triggan
triggan
ATApache TinkerPop
Created by Andys1814 on 8/26/2024 in #questions
Very slow regex query (AWS Neptune)
To Dave's point... there are some data modeling strategies that you can use to eliminate the need for OpenSearch and use exact match references in your cluster: 1) If case is an issue, create an all lower or upper-case version of the property value and use that to match. 2) If searching for a specific term (and when you know you might be looking for that term again in the future), just do a one-time search + add for that term. Properties in Neptune have a default cardinality of set, so each property can contain multiple values. A good example of this is if I wanted to do a lookup of all movies in a database that contained "Star Wars: Episode" in the title.
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).values('title')
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).values('title')
This query takes ~5.5s to run across all movies in my dataset. I can add Star Wars: Episode as an additional title property value to each of those via:
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).property('title','Star Wars: Episode')
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).property('title','Star Wars: Episode')
Then I can just use a query like:
g.V().hasLabel('movie').has('title','Star Wars: Episode').values('title')
g.V().hasLabel('movie').has('title','Star Wars: Episode').values('title')
which now only takes 1.3ms to run to find the results. You could also store the exact match term as a separate property value too. I'm just storing it back into title for simplicity sake here. This pattern is only beneficial when you know you're going to need to find things multiple times. For adhoc searches, using OpenSearch is really the only solution. Neptune has no means, at present, to build an internal full-text-search index.
6 replies
ATApache TinkerPop
Created by Balan on 8/23/2024 in #questions
How can we extract values only
What client are you using? Are you using one of the Gremlin Language Variants (i.e. gremlin-python, Gremlin-Javascript, etc.)? If so, which one? Or are you using the AWS SDK and the NeptuneData execute_gremlin_query API? Each of these have their own method to specify the serialization format. You can choose a serialization format for GraphSON that does not return data types: https://tinkerpop.apache.org/docs/current/reference/#_graphson, as @Kennh mentioned. For example, the gremlin-python client you would use something like:
g = traversal().with_remote(DriverRemoteConnection('ws://localhost:8182/gremlin','g',
message_serializer = gremlin_python.driver.serializer.GraphSONUntypedMessageSerializerV1))
g = traversal().with_remote(DriverRemoteConnection('ws://localhost:8182/gremlin','g',
message_serializer = gremlin_python.driver.serializer.GraphSONUntypedMessageSerializerV1))
With the NeptuneData API call in boto3:
response = client.execute_gremlin_query(
gremlinQuery='g.V().valueMap()',
serializer='GraphSONUntypedMessageSerializerV1'
)
response = client.execute_gremlin_query(
gremlinQuery='g.V().valueMap()',
serializer='GraphSONUntypedMessageSerializerV1'
)
5 replies
ATApache TinkerPop
Created by Vitor Martins on 8/8/2024 in #questions
Optimizing connection between Python API (FastAPI) and Neptune
Sounds good. We're more than happy to help. We can keep the thread open. Also happy to jump on a call and talk things through, as needed.
11 replies
ATApache TinkerPop
Created by Vitor Martins on 8/8/2024 in #questions
Optimizing connection between Python API (FastAPI) and Neptune
3. Websockets are great if you have a workload that is constantly sending requests or if you're taking advantage of streaming data back to a client. However, they do come with the overhead of needing to maintain these connections and handle reconnects when these connections die. Generally these connections are long-lived in Neptune, except in the case when they are idle (>20-25 minutes) or in the event that you're using IAM Authentication (we terminate any connection older than 10 days regardless of idle or not). If your workload isn’t taking advantage of websockets, you may want to consider moving to basic http requests instead. Neptune now has a NeptuneData API as part of the boto3 SDK that you can use to send requests to a Neptune endpoint as an http request without the need of a live websocket connection. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/neptunedata.html

 4. Count queries in Neptune are full scans. So any time you need to do a groupCount().by(label) for all vertices, this is going to scan the entire database. If you just want a count of nodes or nodes with a certain property, there’s also the Summary API that you can use for this: https://docs.aws.amazon.com/neptune/latest/userguide/neptune-graph-summary.html
11 replies
ATApache TinkerPop
Created by Vitor Martins on 8/8/2024 in #questions
Optimizing connection between Python API (FastAPI) and Neptune
There's a lot to unpack here.... 1. We state in our docs that t4g.medium instances are really not great for production workloads. We support them for initial development so users can keep cost down, but the amount of resources available, and the fact that they are burstable instances, really constrains their usability. Once you've used up CPU credits, you're going to get throttled. 2. Neptune's concurrency model is based on instance size and the number of vCPUs per instance. For each vCPU, there are two query execution threads. So on a t4g.medium or an r6g.large instance, there are 2 vCPUs. That means that instance can only be computing 4 concurrent requests at a time. If you need more concurrency, then you should look to scale to a larger instance with more vCPUs. If you're workload varies over time, you may want to investigate using Neptune Serverless, which can automatically scale vertically to meet the needs of the application. There's a good presentation from last year's re:Invent that discusses when Serverless works best and when not to use it: https://youtu.be/xAdWa0Ahiok?si=OeSe-_L3ErcYH-XU Similar to this, connection pool size would likely mirror the number of available execution threads on the instance(s). You can send more requests to a Neptune instance over what it can currently process, but those additional requests will be queued (up to ~8,000 requests can end up in the request queue, which you can monitor via the MainQueuePendingQueueReqeusts CloudWatch metric).
11 replies
ATApache TinkerPop
Created by ManabuBeach on 8/6/2024 in #questions
## Breadth-First Traversal Fact Check
Neptune uses BFS as the default traversal strategy. You can change the method in which a repeat() is executed via the query hint as noted here: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-query-hints-repeatMode.html
4 replies
ATApache TinkerPop
Created by spc16670 on 6/12/2024 in #questions
Dynamically calculated values from last vertex to select outgoing edges for further traversal
Curious what you're using as min/max NCU settings when using Serverless.
13 replies
ATApache TinkerPop
Created by spc16670 on 6/12/2024 in #questions
Dynamically calculated values from last vertex to select outgoing edges for further traversal
Neptune Database was designed as more of an OLTP datastore that does well with workloads that require high concurrency and more "graph local" queries. Neptune can run more OLAPish queries, but it may require setting longer timeouts or breaking the query into smaller chunks and submitting those as concurrent queries. Neptune Analytics was designed more for OLAP needs, specifically running common graph algorithms (that are built-in) against an entire graph. So there is a Connected Components algo in Neptune Analytics that can be used for the purpose that you mention. You can create a Neptune Analytics graph from source data in S3 or from importing directly from a Neptune Database cluster. Data loads in Neptune Analytics are also much faster that DB, as the data is stored and indexed differently for execution graph algos. The one current caveat with Neptune Analytics (NA) is that it currently only supports openCypher. We are actively working on Gremlin support for NA.
13 replies
ATApache TinkerPop
Created by Aiman on 6/19/2024 in #questions
Expecting java.lang.ArrayList/java.lang.List instead of java.lang.String. Where am I going wrong?
g.addV('domain').property('language', ['English', 'Hindi']).property('name', 'aim')
g.addV('domain').property('language', ['English', 'Hindi']).property('name', 'aim')
32 replies
ATApache TinkerPop
Created by Aiman on 6/19/2024 in #questions
Expecting java.lang.ArrayList/java.lang.List instead of java.lang.String. Where am I going wrong?
Perhaps this is some oddity with Janusgraph, then. As TinkerGraph in Gremlin Console does just as you would expect:
gremlin> res = g.V(0L).properties('language').value().next()
==>English
==>Hindi
gremlin> res.getClass()
==>class java.util.ArrayList
gremlin> res = g.V(0L).properties('name').value().next()
==>aim
gremlin> res.getClass()
==>class java.lang.String
gremlin>
gremlin> res = g.V(0L).properties('language').value().next()
==>English
==>Hindi
gremlin> res.getClass()
==>class java.util.ArrayList
gremlin> res = g.V(0L).properties('name').value().next()
==>aim
gremlin> res.getClass()
==>class java.lang.String
gremlin>
32 replies
ATApache TinkerPop
Created by Aiman on 6/19/2024 in #questions
Expecting java.lang.ArrayList/java.lang.List instead of java.lang.String. Where am I going wrong?
Or as a "stringified" list?
32 replies
ATApache TinkerPop
Created by Aiman on 6/19/2024 in #questions
Expecting java.lang.ArrayList/java.lang.List instead of java.lang.String. Where am I going wrong?
I don't quite understand what your desired outcome is here. Do you want the result returned as a list or as a string?
32 replies
ATApache TinkerPop
Created by Aiman on 6/19/2024 in #questions
Expecting java.lang.ArrayList/java.lang.List instead of java.lang.String. Where am I going wrong?
n would be a number. As in, n=3 to return 3 of the next results in a list.
32 replies
ATApache TinkerPop
Created by Aiman on 6/19/2024 in #questions
Expecting java.lang.ArrayList/java.lang.List instead of java.lang.String. Where am I going wrong?
Might be next() causing the issue there... what do you get if you use toList() instead? or next(n) would return n results in a list.
32 replies
ATApache TinkerPop
Created by spc16670 on 6/12/2024 in #questions
Dynamically calculated values from last vertex to select outgoing edges for further traversal
intersect() is a new step that was added in 3.7.x. Neptune engine version 1.3.2.0 supports those steps. In the meantime, you may also be able to leverage the within() condition. Such as has('propKey',within(<some_list>))
13 replies
ATApache TinkerPop
Created by spc16670 on 6/12/2024 in #questions
Dynamically calculated values from last vertex to select outgoing edges for further traversal
I haven't had time to circle back on this today. You may want to investigate patterns that use sack() or aggregate() within a repeat() to solve this. https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#sackintro In @Kelvin Lawrence 's examples, he is summing a series of numbers using a withSack(0), but you can also gather a list of items using withSack([]). You may be able to leverage that along with the newer intersect() step to find when types from previous nodes are seen on a node in the path. If I have time tomorrow, I'll try to circle back on this, but hopefully the details above give you (or someone else here) something to go on.
13 replies
ATApache TinkerPop
Created by KP on 6/10/2024 in #questions
Window functions in gremlin
Interesting, as I really don't think of window functions that much in the graph world. It's typically more related to tabular or time-series data. Can you explain a bit more about what you're attempting to do with the data in a graph as opposed to it being in a relational database?
There are iterative aggregation patterns in graph. In Gremlin, there is the sack() step that allows you to gather up a set of values as a traversal is executed and then act on that "sack" of values at the end of the query. https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#sackintro
3 replies
ATApache TinkerPop
Created by RuS2m on 6/1/2024 in #questions
Analyzing samples of Gremlin Queries in Neptune Notebook
If you're not familiar with SPOG and what that equates to in Neptune, here's the document that explains it: https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html
9 replies
ATApache TinkerPop
Created by RuS2m on 6/1/2024 in #questions
Analyzing samples of Gremlin Queries in Neptune Notebook
I think this is going to depend on how granular you want to get. If the intent is to see what labeled vertices or edges are accessed, then just looking at a query in the audit log would be sufficient. But, if your intent is to see every atomic component that is accessed in the database as part of query execution, that could be expensive. It is possible, though. You could run every query through the Neptune Gremlin Profiler: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html and set profile.indexOps to True and you'll get an output at the bottom of the profile output with every index operation that occurs. These will equate to some permutation of S-P-O-G patterns that are used in the three different built-in indexes (or fourth index, if enabled).
With the list of indexed lookup patterns, you could possibly maintain an external counter (maybe in sorted set in Redis/Valkey) with a a key of the S-P-O-G combination and the value being the number of times accessed. Just be aware that attaining a Neptune Gremlin Profile output requires that you run the query again. So you may not be able to use this to capture writes (without rewriting the data) and it will incur additional database resources to re-run all of the read queries.
9 replies
ATApache TinkerPop
Created by M. alhaddad on 5/8/2024 in #questions
Using dedup with Neptune
I guess what I'm getting at, is that I don't know of a way to make dedup() any more performant in that sort of query with Neptune's current implementation.
As far as pagination goes, have you tried using Neptune's Query Results Cache instead of making multiple range() calls? That would significantly decrease latency for subsequent calls as you paginate across the resuls: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-results-cache.html
11 replies