SideEffect a variable, Use it later after BarrierStep?
I seek a query that builds a list and then needs to both sum the list's mapped values and divide the resulting sum by the count of the original list.
This would be the
mean()
step - if the mapped list was still a Gremlin traversal object that offered mean()
.
However, the mapped list is, by that time, a Groovy list and mean()
is no longer available.
A groovy list can be averaged by reducing it to it sum and then dividing by the count of the list - but to get the count of the list, a separate reference to the list is needed or the count has to have been cached before.
Therefore, is there a way to begin a query, sideEffect a value into a Groovy variable, complete the query, pass through a barrier step, and then divide the saved count?
Solution:Jump to solution
It can be done with all Gremlin steps in 3.7.1 where you have date functions available. Assuming:
```groovy
g.addV().as('a').
addE('link').from('a').to('a').property('createdDate', '2023-01-01T00:00:00Z').
addE('link').from('a').to('a').property('createdDate', '2023-01-01T00:30:00Z')....
SlideShare
Extending Gremlin with Foundational Steps
Extending Gremlin with Foundational Steps - Download as a PDF or view online for free
16 Replies
If there is a way to have Gremlin traverse a collection and to operate on adjacent pairs (to get the delta between creations) without having to fall into Groovy, then I could stay in Gremlin and use
mean()
at the end of that process, directly. I think I could "slide" a sideEffect "cursor" as the traversal gets collected to perform the map()
in-line - but it wasn't as pretty as Groovy collate.Solution
It can be done with all Gremlin steps in 3.7.1 where you have date functions available. Assuming:
You can use the "partition" pattern i came up with for a talk I have a long time ago about DSLs. It effectively does the same as a Groovy
collate()
. Combining that with the neat new date functions I think this gives you the ability to do the mean()
you want at the end:
Here's a link to the slides for the talk called "Extending Gremlin with Foundational Steps" in case you're interested: https://www.slideshare.net/StephenMallette/extending-gremlin-with-foundational-stepsSlideShare
Extending Gremlin with Foundational Steps
Extending Gremlin with Foundational Steps - Download as a PDF or view online for free
Whether or not it works, I love the opportunity to study the pattern. To try to understand fold.emit.until(__.not(unfold()).repeat(skip(local,2)
Also, I have to work with older Tinkerpop. The date methods would be nice.
nice to see you appreciate some of the finer points of Gremlin :gremlin: 🙂
I do have this sense that "local" is like some magical token that can be dropped into more formal settings to solve challenges that would otherwise take substantially more method calls. I might have chosen "doIntendedVsDocumented" as the string for that keyword. I have to go study your query to really grok it.
well,
local
does spare you an unfold()
at minimum which is great when you want to preserve a list structure. my example doesn't preserve it but hopefully you get the idea:
also, we've overloaded the word "local" a bit more than we should have perhaps. don't confuse
local()
step with Scope.local
. it's worth noting that local()
step has some interesting properties:
https://stephen.genoprime.com/snippet/2020/04/25/snippet-10.html
and
https://stephen.genoprime.com/snippet/2023/04/13/snippet-15.htmlstephen mallette
Use of local()
One of the more misunderstood, misused or simply unknown parts of the Gremlin language is local()-step. Its purpose is to execute a child traversal on a single element within the stream. In the following example limit(2) is applied to the stream in a global fashion and thus only two results are returned:
stephen mallette
Use of local() Revisited
In recent weeks there have been many questions about the local()-step both in Discord and in TinkerPop’s JIRA. I’d written on this topic once before in Use of local() but it perhaps begs a further revisit. The example that arose from Discord helped inspire this post as it was sufficiently primitive to hopefully clarify usage.
Ok, it's a functional beauty.
// fetch the sequence of date values and sort them chronologically from earliest to most recent
g.E().values('createdDate').asDate().order().by( Order.asc )
// wrap the sequence as a single collection
.fold().
// generate a sequence of shortening collections, each collection will be two elements shorter than the prior list
emit(). // each iteration, yield the contextual collection
until(__.not(unfold())). // just keep going until we run out into the empty set
repeat(skip(local,2)). // each iteration, make sure to chop off the head and the penultimate head
// remove the last collection that is the empty set
filter(unfold()).
// now, for each collection in the sequence of collections
// fetch the head and penultimate head, memoize the first as dts and step forward one
limit(local,2).as('dts').limit(local,1).
// given the current date entry, diff it from the prior one excluding the one after it
dateDiff(select('dts').skip(local,1).unfold()).
// we now have a sequence of deltas, compute their mean
mean()
I'm intrigued. When this plays out in real time, do we actually see it resolve down to finding a set of dates and then picking from that one set, a pair at a time, to contribute to an accumulating sum? Or, no, we would see it making numerous copies of large (but shrinking) collections over and over only to shed most of the list to obtain the leading pair?
I think this better fits the second scenario you describe. You make one large list after the first
fold()
and then you have the semantics of a do/while with the repeat()
where that big list is being winnowed down 2 items at a time, so that when you do limit(local, 2)
you are only grabbing the leading pair. not spectacularly efficient, but it was nice for my demonstration in my talk and for cases where you don't have a massive list to partition it could work reasonably well. As I was trying to answer your question I tried to see if It could be done without the initial fold()
because I think I did that purposefully for the context of my talk all those years ago. In your case, you don't really need it. I spent some more time thinking on it this morning and can't really come up with a way. I think that in Gremlin collate()
(or as i called it in my talk partition()
step), I feel like we would probably call it take()
that would let you do:
It feels like limit()
but that's a filter step so when you limit(2)
it just chops out the rest of the stream. take()
would let you process the whole stream in chunks. @Kelvin Lawrence do you have any tricky ways to do a "take()" as shown above beyond my earlier pattern that I described with fold()...repeat()
?
i suppose a Scope.local
version of take()
would probably exist too, though that sounds weird. partition()
or collate()
seems to make more sense in a List
context. i guess you could "partition" or "collate" a stream too. eh...i dunno: https://issues.apache.org/jira/browse/TINKERPOP-3057Stephen, I think this is not right. And it is my fault for how I proposed the setup. I want to compute the difference in time between each date in the set from its neighbor - and not compute the differences between pairs that form a set of pairs. For example, with A..B..C..D, I want 3 differences, not 2. That is, B-A, C-B, and D-C. Which means, btw, I was pretty dim. The average difference in a long partially ordered sequence like that is just (max - min)/num.
Here's that query:
I'm sure that the projection and the reduction could be more Gremlin-idiomatic - but it is hard to diagnose why my "by" clauses are not tolerated by the engine. Getting the right class of object to those is not always intuitive. I can see that I resort to "aw heck, give me the collection as a Groovy/Java list and I can revert to that language to do what I want".
Related, yes, Scala has a take(n) concept.
BTW, does or could Tinkerpop map() come to accept by() clauses such that we could feed multiple arguments into the map lambda?
For pseudo-example:
something().map{ x,y,z -> x+y+z }.by( method-to-get-x ).by( method-to-get-y ).by( method-to-get-z )
The idea is very similar to how "math()" is implemented. The difference is that "math" is opinionated about the transformation. It is going to take parameterized inputs and try to construct a valid mathematical expression and then compute the result. Map-By-By would allow the users to construct their own parameterized transformations. Without the anonymous applied lamba trick that I use (see above where I subtract and then divide given an incoming member instance).it's not quite the same thing. the number of arguments to the closure need to be understood at a compile time. not at run time.
besides i think that would violate the intention of
map()
which is a 1:1 transformationmy intent is still 1-to-1 (many to 1 would be a reduce operation). In the example query above, it only yields 1 result - like a reduction - because the stream feeding into that map has been constructed to yield only 1 member.
A 1:1 map with a multiple-argument closure is very useful - and is often easier to analyze than is a single-argument closure that happens to include external state via variable reference where the variables are conveniently available in the closure's context.
The number of arguments would be explicit - just as they are for "math". The author must provide a by clause for each parameter - or it must be very intuitive how Gremlin would dynamically resolve "implicit" parameters from arbitrary queries. (Scala loves these nasty implicits.)
maybe i'm not following what you're saying exactly, but at least in Java, the
map()
step signature for a closure is map​(Function<Traverser<E>,​E2> function)
. To do two arguments you would also define some sort of BiFunction
signature. For three arguments a TriFunction
and so on. I don't see how n number of by()
clauses as you describe can work with this model. map()
is meant to apply a map function to the current Traverser
object, so I'm not even sure what the second, third, etc argument would be.
fwiw, TinkerPop doesn't really promote the use of closures/lambdas. They don't work for every programming language or for every graph database. They also bring performance and security problems for certain cases that relate to Gremlin Server. Given those drawbacks, I don't think we'd be looking to expand their capabilities all that much.I agree with you on how it might look for Java. Because Gremlin's math step is very similar to what I intend, I ask how multi-variate mathematical steps with by clauses are done in Gremlin. Why is it not apparent how convenient it would be to perform an arbitrary transformation to each element of a sequence where the arguments of the applied function include not only the current element but one or more additional arguments which are related to the current element?
I appreciate the Beauty of a language but I am not adverse to be practical and to use language features when they exist. Gremlin supports Groovy lambdas and there are times when they get the job done. Sometimes, getting an answer at all is better than not being able to get an answer because it wasn't optimized for performance. I would argue against removing lambda support from Gremlin, simply because performance stinks at times.
the secondary arguments would all be the result of traversals which begin at the traversal arriving at the map-by-by step, where each secondary traversal is determined by the corresponding by clause. One could, of course, provide constant() or cap() values.
I ask how multi-variate mathematical steps with by clauses are done in Gremlinno
by()
operations in Gremlin produce a syntax where there the number of actual arguments to the closure change. math()
is the same as map()
- it is effectively taking the current Traverser<E>
and returning E2
just like map()
. The MathStep.java
implementation even extends MapStep.java
to implement what it does. The by()
steps merely apply transformations to the Traverser
argument to produce different "variables". These variables for math()
are then evaluated through a math expression library to produce the output so you can dynamically determine the number of arguments at run time. I don't see how it is possible to do that for map()
in the same way because it requires a compile time closure definition for an argument.
i suppose you could add some builtin versions like map(BiFunction)
, map(TriFunction)
, etc. to some top number of arguments, but that would limit the number of by()
you could have and isn't the same as what you've been talking about with math()
from my understanding. Personally, I don't favor those overloads. If the developer has chosen to use a closure and map()
because Gremlin couldn't do quite what they wanted, then they have already chosen a non-Gremlin way to handle a task. With the entire JVM available to them for processing their data in that closure, I don't feel like they need any extra help from Gremlin at that point to use that closure.
I would argue against removing lambda support from Gremlini wouldn't be in favor of removing them completely, just from languages that don't support them natively, like Python, Go, Javascript and C#. I'd also think it smart to remove them as a default for remote scripting in Gremlin Server.
This is precisely what I understand:
The MathStep.java implementation even extends MapStep.java to implement what it does. The by() steps merely apply transformations to the Traverser argument to produce different "variables".
What I ask is, how about making that generally available to MapStep itself? Let the author of the query decide whether or not that is business useful.
It is true that the unpacking of the Traversal, the "it", can be done instead the lambda/closure - but sometimes it is more convenient to use the Gremlin language and method to do that, immediately before entering the Groovy/Java context.