Are the developers of TinkerPop interested in the Performance difference on the equivalent queries?
Hi all! I was recently working on generating equivalent Gremlin queries to test TinkerGraph and then found that some of the equivalent queries had large runtime differences.
For example,
The two queries to select nodes with the red color are equivalent; however, their running times can vary significantly. For instance, if the number of red nodes is much smaller in the first query compared to the second one, the first query may produce a large number of intermediate results while the second query avoids this.
I noticed that TinkerGraph optimizes the traversal strategy at match time, which suggests that the queries in this simple example might be optimized. However, during my actual testing, I encountered some triggering test cases. Although I believe this behavior may be expected in Gremlin, I would like to reach out to the developers to inquire if they are interested in optimizing these queries further. In that case, I can share the test cases I found (which may not be as straightforward as this example, unfortunately :(, but I will try my best to reduce and simplify them ).
26 Replies
According to my knowledge, the majority of researchers are using the Cypher2Gremlin translator to test TinkerGraph. However, I find it challenging to effectively test the unique features and capabilities of Gremlin using this approach. Recently, I have been working on developing Gremlin generators and testing algorithms. I plan to share such performance discrepancies and logic bugs discovered by these algorithms, after my further reduction and simplification.
I don't think there is anything unexpected happening here. You can see how the traversal is optimized by adding a
.explain()
step at the end of your traversals. When I took a look, it appeared the same optimizations were being applied to both traversals.
I believe the only factor in your performance differences is that the second query is able to shrink the search space much earlier and faster if there are very few red vertices. If you instead run both of those queries on a graph with few blue vertices but many red ones, I suspect the first query would be the much more performant one.
I don't believe this sort of difference can be optimized in the general case as identifying the optimal version of this traversal requires some existing awareness of which filter will reduce the reduce the search space quicker but perhaps some of the other developers here may have other ideas.@colegreer Many thanks for your explanation, and that makes sense! I agree with you that there may not be some general ways to optimize the difference.
Another question about MATCH: I noticed that some queries of MATCH will return the exception about "unsolvable pattern" for some graphs, and return the result for others graphs. Is this behavior expected? 🥰
For this problem, just a brainstorm: How about generating some equivalent queries and using some algorithm that can be executed rapidly (such as DP on the graph) to estimate the time cost of each query, and then select the optimal query to real execute? It looks like a topic for researchers in Database (SIGMOD/VLDB), but I am just a researcher in SE for testing bugs in GDBMS 😀
Are you looking for ways of rewriting queries to make them much faster in a general use case?
Gremlin does let you replace step(s)s in queries with other step(s) on the fly. An example of where you might do something like this. Say you have the following:
g.V().has("property1", value1).out().out().out().has("property2", value2).path()
You could dynamically consider converting this into:
g.V().has("property1", value1).out().out()
and
g.V().has("property1",value2).in().in()
with a set intersection of the results and path reconstruction.
But the problem here is it's not that simple. If you assume that there is not many matches for value2
and you have an index on value2
this may massively reduce your search area. However the opposite could be the case and you could blow your search area up to be way bigger.
So to consider the cost of a query, you need to consider things like the branch factor for your vertices, the cardinality of properties, which properties are indexed, the cost for different operations, etc.
anyway I am interested in performance increases by rewriting queries and have investigated using branch factors, cardinality, and other thing to rework traversals, but a general or integrated solution to this problem would be coolYeah I agree with Lyndon on this one. Would be super cool to have a general way to do these sorts of optimizations but it requires a lot of awareness of what steps will reduce search space the quickest. I'm sure some heuristics could be developed which are reasonably effective at predicting this but it's not a trivial problem by any means. I do like the idea of generating a ton of equivalent queries and profiling them to try and find relationships between different graph structures and which queries are most performant on them. Feels like a good ML problem to me.
Yeah, I agree with both of you and maybe it relates to some problems in Graph learning 🥰 BTW, how about the other question? Maybe these behaviors are unexpected because I think whether a pattern is solvable in Gremlin does not depend on the graph.
PS: I think it's cool to be able to communicate with you here instead of posting on Jira, because there are so many interesting things in Gremlin to learn in depth. Thank you all for providing this discord platform!🥰
when you write that it fails for some graphs but not others could you clarify what you mean there? like, do you mean some graph implementations but not others? like, it works on Neptune but not on JanusGraph? or did you mean like different graph data?
The different graph data, it fails in the database I create and ok to execute in modern graph.
maybe i still don't fully understand, but i think that's expected. i think you would need to provide a Gremlin Console example with some sample data to demonstrate the exact problem you're having
Thanks! I will provide some examples after the reduction ASAP. I mean that the same query fails with an "unsolvable match" exception in certain graph data, while it is ok to execute in the TinkerGraphModern.
Consider the following Gremlin query in an empty graph, it is OK to execute: However, it will lead to an "unsolvable pattern" exception in the graph Mordern
in that case, yes,
match()
will fail at runtime if the pattern doesn't exist in the data structures that it evaluates because the match algorithm is actually evaluating traverser paths during execution to find matches. for an empty graph match()
never even executes at all because there are no traversers to start the execution.Thanks for your help, but I have a question why in the second case instead of returning an empty result it returns such an exception? It seems a bit confusing to the user because the pattern itself is not unsolvable, but there is no result in the database that matches the condition.
Gremlin has never been consistent when it comes to returning an empty result versus throwing exceptions. Since 3.5.2/3.6.0 we've made a lot of improvements in that area to get better consistency, with the preference to an empty result rather than exceptions. There is still more work to do in this area, but newer versions throw far less runtime xceptions than it's predecessors. One big example of how this has changed is in the upgrade documentation here: https://tinkerpop.apache.org/docs/current/upgrade/#_consistent_by_behavior - Ultimately,
match()
step is not used terribly often in my experience, so perhaps that's why we haven't gotten around to making it work more consistently. I think match()
needs an overhaul in syntax and usage and i think there is opportunity to create an even better declarative approach for Gremlin and open up its usage for integration with other declarative graph query languages. maybe we'll see that for 3.8.0.Thank you very much for your thorough explanation! I agree with you that many statements with "match" can be better replaced by other statements in Gremlin, and I use these test cases only because they are part of my testing algorithm.
Regarding the example of returning an exception, I'd like to gather your perspective on whether it should be considered a bug or an area for potential improvement in the future. While I truly enjoy our discussion on Discord, I had to post some questions on Jira before for the purpose of paper publication (paper should quote these Jira links). This will enable us to open-source our testing tools at the earliest opportunity. We kindly request your permission to post this example on Jira as a point for further improvement, which would be highly appreciated 🥰 .
we have not classified "exception vs filter" issues as "bugs" since the behavior isn't something explicitly broken. it's more of an area for improvement in my view at least. if you wanted to create a JIRA to prefer filtering rather than exception for
match()
that would be fine.Thanks again! I agree with you that it does not lead to incorrect results or crashes. In the research, they are also interested in these unexpected exception issues, and I believe that filtering the exception in such cases will improve the behaviors of the downstream applications (e.g., receiving an exception may make their code harder to handle). Thus, I will create a JIRA for filtering the unexpected exception 🥰 .
PS: it looks like there are no well-defined rules for solvable patterns in Gremlin, looking forward to your new approach in 3.8.0
could you say a bit more about what you mean by "well-defined rules for solvable patterns in Gremlin"? what do you mean by that?
Sure. I've tried going through the documentation and I don't seem to see a definition at the query level of whether a pattern can be solved, i.e., whether the query can be solved in Gremlin without actually executing it.
well, for your specific case, i think you could tell that pattern isn't going to be solved without executing it:
you would just check the labels and see that "C" and "D" are invalid.
match()
requires that you have a single start label, in your case you went with "A". you now have "C" and "D" also trying to provide that a label to that same incoming vertex. as a result it ends as unsolvable. as i write this now, i'm not so sure this is a situation where we want to have it just filters to an empty result. it's simply, not valid so if we just filtered the result you'd have no idea why. sorry, i didn't pay as close attention to the query itself as I should have when i answered earlier.Thanks for your help! So is it right to say that when we write a match, every sub-traversal has to start from an existing label, except for the starting node?
If so, I think in the previous case, we miss an exception in the empty graph, since the pattern is really unsolvable in Gremlin.
yes, i think that's the only one that can take on the value of the incoming traverser without it being predefined. regarding the failing traversal, i don't know what your aim was but you could get it to return results with:
For empty graphs you simply don't get an exception because
V()
returns no vertices and so match()
never executes to throw the exception. I think the validation could probably move to a TraversalStrategy
so that the exception fires on compilation rather than execution and I perhaps see some comments in the code that allude to changing things for it to work that way (but it was never done).This code you provided is OK to execute! I agree with you that moving the validation to compilation will make Gremlin more consistent and help the downstream applications. Could I post a Jira for that improvement of missing exceptions rather than filtering?
that's fine. if the exception could be provided earlier, prior to execution, i think that would be best. i don't know the
match()
code well so there may yet be things that preclude that from happening so i'm not sure if it will be easy to fix or not. i also don't know if any core contributors will make that a big priority to fix. match()
is not used terribly often from what i can personally tell and so given all the other things folks could choose to work on, i'm not sure it's at the top of the list. as i said earlier, i hope the priority discussion for match()
will be more about how to rework it more completely. that said, we'd be happy to try to help provide guidance on how to fix it though if you or the folks you are working with it would like to contribute a pull request. thanks for all this discussion by the way - it was interesting.Thanks! I found some complex cases just now, where such an unsolvable pattern will lead to a non-zero result, and this result is not equal its equivalent query. I will also post the test cases on Jira.
I am not sure whether result is correct since the test cases are complex, but it looks like that missing such exception will lead to a worse result than we thought before.
you and your team members might be ready yet, but perhaps the work your doing for your paper would be an interesting topic for TinkerPop's twitch stream. We do a series called "TinkerPop Wide" which showcases the work from the different parts of the TinkerPop/graph community. I'd be happy to talk with you all about doing one if you'd like to tell the community about it. you can see some examples here: https://www.youtube.com/playlist?list=PLGDjrkYcxDce8wLhdGKh2iGa_mkbK_GFV please let me know if/when you're interested 🙂
Thank you very much for your invitation and for your help in investigating these bugs 🥰 , we will improve and open-source our work ASAP. Once this is done, we will contact you to share our technology on your twitch stream and discuss with you how to make it a better testing tool for developers, similar to what SQLancer did for relational databases.
issue and the complex cases on Jira https://issues.apache.org/jira/projects/TINKERPOP/issues/TINKERPOP-2961