Dynamically calculated values from last vertex to select outgoing edges for further traversal
Hi - I asked the question on SO (https://stackoverflow.com/questions/78611365/neptune-graph-traversal-that-uses-dynamically-calculated-values-from-last-vertex) but there is no answer. I am really interested if this is a scenario gremlin can handle. I am not very well versed with Tinkerpop so I am not sure whether I am just trying to build a query that is complex, or asking for a case that is just not supported in tinkerpop. I work for a company that is seriously considering onboarding onto gremlin supported database but we are not sure whether it will really support all our data access patterns. Any help with that SO question would be be greatly appreciated.
Stack Overflow
Neptune graph traversal that uses dynamically calculated values fro...
Here is my graph:
Here is graph building code:
g.addV("Person").property(T.id, "A").as("a")
.addV("Person").property(T.id, "B").as("b...
6 Replies
I haven't had time to circle back on this today. You may want to investigate patterns that use
sack()
or aggregate()
within a repeat()
to solve this. https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html#sackintro
In @Kelvin Lawrence 's examples, he is summing a series of numbers using a withSack(0)
, but you can also gather a list of items using withSack([])
. You may be able to leverage that along with the newer intersect()
step to find when types from previous nodes are seen on a node in the path.
If I have time tomorrow, I'll try to circle back on this, but hopefully the details above give you (or someone else here) something to go on.Thank you - I would really appreciate help with this. I have found a good Gremlin tutorial online (for DataStax) and Kelvin's book is very much on my reading list, but my company is looking for quick answers at this stage and I can only see myself diving deep into gremlin once a decision has been made to indeed adopt Neptune for production use. Also, whatever query I am going to put together will be probably less performant than one devised by someone with Gremlin experience - so even if I arrive at a query that gives me the expected results I would still seek consultation on whether the query is traversing the graph in the most performant way.
Hi @triggan - thanks to your suggestions I think I have something that is working:
I just have to figure out how to supply the initial value and add some tests for edge cases.
I have come across several problems now - 1) it does not look like Neptune has support for intersect() 2) the values("types").fold() folds types for all nodes (on the frontier - is that how we say it) and I want to evaluate the types per local step, do I need to factor 'local' in somehow?
I think intersect is not supported because I am not using the latest version of Neptune - I will try with a newer version
intersect()
is a new step that was added in 3.7.x. Neptune engine version 1.3.2.0 supports those steps.
In the meantime, you may also be able to leverage the within()
condition. Such as has('propKey',within(<some_list>))
I have parked work on this as we are seeing many timeouts with other queries we run... We are running serverless and with timeout settings set to 30 minutes the queries still time out. The repeat loops seems to dive 6 levels deeps and then it just hangs... Our requirement is to discover all nodes a particular node can be connected to and Neptune does not seem to be fit for the task... Do we have to look at some Spark based solutions?
There is only 660k nodes and 1.03 mln edges
Neptune Database was designed as more of an OLTP datastore that does well with workloads that require high concurrency and more "graph local" queries. Neptune can run more OLAPish queries, but it may require setting longer timeouts or breaking the query into smaller chunks and submitting those as concurrent queries.
Neptune Analytics was designed more for OLAP needs, specifically running common graph algorithms (that are built-in) against an entire graph. So there is a Connected Components algo in Neptune Analytics that can be used for the purpose that you mention. You can create a Neptune Analytics graph from source data in S3 or from importing directly from a Neptune Database cluster. Data loads in Neptune Analytics are also much faster that DB, as the data is stored and indexed differently for execution graph algos. The one current caveat with Neptune Analytics (NA) is that it currently only supports openCypher. We are actively working on Gremlin support for NA.
Curious what you're using as min/max NCU settings when using Serverless.
These are my NCU settings:
I could reduce the frontier if I could implement the logic from the SO question - but I am not sure 'sack' does the job here. I need the intersection results kept per traverser when the repeat ends and a new repeat starts, so it continues starting from vector traversals that have intersected values associated with them from the last repeat. I think sack is merging all these calculated intersections into one array and passes the whole bag to the next iteration.