jessea
jessea
ATApache TinkerPop
Created by jessea on 3/21/2023 in #questions
Extracting the ProjectStep of a GraphTraversal instance during unit testing
Tl;dr Given an instance of GraphTraversal<?, Map<String, Object>>, is it possible to extract the ProjectStep<?> which is producing the Map<String, Object> inferred by the type system, and which would be returned once a terminal step is applied? We only want to access the .projectKeys() of the ProjectStep, so we don't need to actually execute the traversal. It can be assumed that we are always dealing with an instance of GraphTraversal<?, Map<String, Object>>, and we do not have access to the actual graph in this environment. Context We have a system written in Java that defines Gremlin traversals to be run against a TinkerPop graph (Neptune, but not relevant for this question). These queries should be written such that they should return Map<String, Object> (i.e. using project()) with a specific set of keys which are defined alongside the query. We aren't yet using a custom DSL, and building queries programmatically to ensure the project step is present causes other problems. Why are we doing this? We'd like to give immediate (build-time) feedback to developers that the query they've written is missing an important key, which would otherwise take a deployment and some waiting time to discover. This key must be present, as the query will be executed by an automated system later which will try to extract a value from the Map containing that same key. What have we tried so far? We've actually managed to do this using the mock example here to capture project steps, and recursively capture any which might include a ProjectStep: https://github.com/apache/tinkerpop/blob/0c382bb7ec345f2758bee207d62d66f95c475a78/gremlin-core/src/test/java/org/apache/tinkerpop/gremlin/process/traversal/dsl/graph/GraphTraversalSourceTest.java#L107-L153 We've accomplished this by matching against the invoked method of the mocked GraphTraversal to detect either calls to project method directly, or to TraversalParent steps which may contain child steps (like local) and from there recursively check all steps in search of a project. At that point, we can gather the projectKeys of the ProjectStep. Happy to share some snippets of this if you're curious. However, our implementation identifies any usage of project anywhere in the query, not necessarily the final one that will be returned by the server once a terminating step is applied.. It's technically 'good enough' as it will always catch cases where the required key is never mentioned in the traversal, but it's possible the query uses multiple projects, and the required key is misplaced and won't be part of the final map returned when executed. Thoughts In theory, I'm thinking this must be possible, since as soon as you use project in a traversal, Java is smart enough to understand that you are now working with a GraphTraversal<?, Map<String, Object>>. This might sometimes happen inside a TraversalParent like local(), but the type system can still infer it will receive a Map when terminated. Is there a method (or collection of methods) which would let us grab the 'last effective step' which returns that Map? The solution can be hacky, this is test-case code so there's room for some jank here 🙂 Thanks, folks!
16 replies
ATApache TinkerPop
Created by jessea on 1/26/2023 in #questions
Limiting .path() results to a number of valid starting vertices
Hey folks, for context, we're using AWS Neptune, and Neptune Notebook for visualisation. We would like to visualise neighbourhoods of data with a given criteria: 1. We would like an exact number of neighbourhoods, e.g. 5 distinct neighbourhoods 2. We only want to consider neighbourhoods that have a particular node in its tree Example: Let's say our 'root' of the neighbourhood is any Foo vertex, and we want to graph the neighbourhoods which include a Bar vertex in its tree (through some explicit traversal). The important point here is that a neighbourhood starting from Foo might not contain a vertex Bar in its tree, in which case we want to skip this one and find another. We know that this query will graph all valid neighbourhoods:
g.V().hasLabel("Foo")
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
g.V().hasLabel("Foo")
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
But let's say we want exactly 5 valid neighbourhoods instead,. If we write g.V().hasLabel("Foo").limit(5)..., then we are not guaranteed that all 5 Foo vertices will actually lead to Bar, sometimes our traversal never makes it to a Bar from one of the randomly chosen Foo starting vertices, and we are left with fewer than 5 neighbourhoods. Placing it at the end, e.g. ....out("c").hasLabel("Bar").limit(5), filters the actual paths returned rather than by the starting . The way we've made this work is to perform a seemingly redundant filter at the beginning to validate that we're starting from a Foo that definitely leads to Bar, but there must be a simpler way of expressing it:
g.V().where(hasLabel("Foo")
.out("a")
.in("b")
.out("c").hasLabel("Bar"))
.limit(5)
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
g.V().where(hasLabel("Foo")
.out("a")
.in("b")
.out("c").hasLabel("Bar"))
.limit(5)
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
We have experimented with grouping by the starting vertex and limiting this, but it seems this does not execute lazily and first collects all neighbourhoods before grouping and limiting:
g.V().hasLabel("Foo").as("start")
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
.group().by(select("start").values("id"))
.select(values).unfold()
.limit(5)
g.V().hasLabel("Foo").as("start")
.out("a")
.in("b")
.out("c").hasLabel("Bar")
.path()
.group().by(select("start").values("id"))
.select(values).unfold()
.limit(5)
I think this might be a trivial question, but I can try to provide some sample data to work with if needed. Thanks all!
34 replies