Using dedup with Neptune
I remember once i came accross AWS Neptune optimization guide that i don't remember where is it now.
It mentions that .dedup() step is not optimized for Neptune which makes performance worse.
However, I have the following scenario where i need deduplicates and pagination at same time.
So the only possible way in mind is to do .dedup() then .range()
Or
.groupCount() then select keys then range()
But i am not sure if grouping does maintain the order all the time.
What could be done?
Solution:Jump to solution
I guess what I'm getting at, is that I don't know of a way to make
As far as pagination goes, have you tried using Neptune's Query Results Cache instead of making multiple
dedup()
any more performant in that sort of query with Neptune's current implementation.As far as pagination goes, have you tried using Neptune's Query Results Cache instead of making multiple
range()
calls? That would significantly decrease latency for subsequent calls as you paginate across the resuls: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-results-cache.html...Caching query results in Amazon Neptune Gremlin - Amazon Neptune
Overview of using the query results cache with Gremlin.
7 Replies
I'd have to see the entire query to understand what you're attempting to do. Often times, when trying to rewrite queries, we take portions of the query that cannot be fully optimized and see if there are ways to push these portions of the query to the very end (allowing the majority of the query to be computed using native Neptune operators).
You can issue the query through the Neptune Gremlin Explain API and the explain plan will state what portions are being optimized and which ones are not. One thing to remember with this, is that non-optimized steps that are nested will cause the parent steps to also be non-optimized. So if you see a situation where the explaining is stating that a portion of the query is not being optimized, then it may be due to having a
dedup()
or a fold()
nested within that portion of the query.
https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-explain.htmlAnalyzing Neptune query execution using Gremlin explain - Amazon Ne...
Use the Gremlin explain feature in Neptune to understand and improve your query execution.
at in_("HasSite")
i will have duplicate output as intended in the design. However, as a final output i want the paginated part to be not affected by duplicates
so i am to apply range() on unique stuff.
The performance would drop too much if i dedup() before range() because dedup() is not optimized
I'm seeing
dedup()
getting pushed down. It's range()
and beyond that isn't, but likely that wouldn't have that great of affect on performance:
Instead of fold().next()
, could you not use toList()
?what benefits does toList() bring? maybe i adapted fold & next since 3 years thats why. but do we have benefits or its just the same?
that means i can do ?
so that won't affect the performance much? even if at in_("HasSite") we have thousands of results
toList()
serializes the result on the client side into a list in whatever runtime you're using. It's effectively doing the same thing, but you're removing the need of the engine to build the list and then return the entire list.Solution
I guess what I'm getting at, is that I don't know of a way to make
As far as pagination goes, have you tried using Neptune's Query Results Cache instead of making multiple
dedup()
any more performant in that sort of query with Neptune's current implementation.As far as pagination goes, have you tried using Neptune's Query Results Cache instead of making multiple
range()
calls? That would significantly decrease latency for subsequent calls as you paginate across the resuls: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-results-cache.htmlCaching query results in Amazon Neptune Gremlin - Amazon Neptune
Overview of using the query results cache with Gremlin.
thanks i will check it out