Splitting a query with range()
I have a Gremlin Query that starts simple (one Label), and then branches out to many different paths to collect unrelated informations (aka, I need to follow those paths). I'm considering using range() to break down that query into smaller chunks of, say 1k rows' and avoid processing the whole set of Labels into one. Of course, I'll have to run the query several times, but I expect each run to be faster, better fit in memory. May be I'll escape some fast degradation by keeping the load small enough.
Does that sound like a good idea?
I'm usually concerned such partitioning means that the common part of the query (before the range()) is executed several times, and that limits the speed potential. In the current case, it is merely a hasLabel() + some property collection.
g.V().hasLabel('xxx'). sideEffect().sideEffect().map()
vs
g.V().hasLabel('xxx'). range(x, x + 1000).sideEffect().sideEffect().map()
Solution:Jump to solution
I have often used range() steps to break up queries, it can be a useful technique but does come with several caveats.
The most important piece is that this will only work if your database guarantees that the common part of your query will always produce results in a consistent order. The default implementation of
range(x, x + 1000)
will first iterate and discard the first x
results, then pass the next 1000. If the result ordering changes on each execution, then you will essentially be taking a random sample of 1000 results each time, instead of progressively going batch by batch.
You already mentioned the performance concerns with the common part of the query being executed each time, due to the way this is implemented, this performance penalty is proportional to x
(minimal penalty when x is small as almost no results are skipped, larger penalty with large x as many results need to be processed and skipped). Results will depend greatly on your DB and your data but in general, if the left-hand side of the query is fast and efficient in your DB, and the right-hand side is slow and complex, then this technique works quite well....4 Replies
i'm sorry but i don't remember - what graph database do you use?
Solution
I have often used range() steps to break up queries, it can be a useful technique but does come with several caveats.
The most important piece is that this will only work if your database guarantees that the common part of your query will always produce results in a consistent order. The default implementation of
range(x, x + 1000)
will first iterate and discard the first x
results, then pass the next 1000. If the result ordering changes on each execution, then you will essentially be taking a random sample of 1000 results each time, instead of progressively going batch by batch.
You already mentioned the performance concerns with the common part of the query being executed each time, due to the way this is implemented, this performance penalty is proportional to x
(minimal penalty when x is small as almost no results are skipped, larger penalty with large x as many results need to be processed and skipped). Results will depend greatly on your DB and your data but in general, if the left-hand side of the query is fast and efficient in your DB, and the right-hand side is slow and complex, then this technique works quite well.
I've mostly used such queries in the form of g.V().range(x, x+1000).foo()... and the results have generally been acceptable for my purposes. Assuming that your database is able to efficiently lookup vertices by label and that there are no ordering concerns there, your proposed solution seems reasonable in my opinion. Other alternatives may be to filter based on vertex id's (depends largely on what type/structure your graph uses for id's), or adding some sort of metadata to your graph to help with partitioning.Tinkergraph and tinkergraph with Neo4j plugin.
First, thanks so much for this detailled answer. This is appreciated.
I have spotted those issues, except the first one with order (it hasn't bitten me yet, but it is obviously waiting).
The current implementation with range(x, x + 1000) works well. It keeps the query limited in execution time. I can use configure the size of the batch (here 1000), until it starts using too much resources for too long.
I still need to test it with more constrained systems: I expected that 101000 queries would be faster than 110000 if there is a bottleneck somewhere. So far, one query is faster anyway, and the performances don't degrade fast enough (that is good).
In my opinion using range() like this is the easiest and most flexible method of splitting queries. As long as you are aware of the considerations mentioned above I think it's a good way to go.
TinkerGraph works well with this as it produces results in a guaranteed order (g.V() in TinkerGraph will always return vertices in the order they were added to the graph). I believe the same is true for Neo4j and the Neo4j plugin although my experience there is limited and I have not seen it documented anywhere or properly tested. If you need a robust guarantee of ordering with Neo4j that likely warrants further investigation.