Using dedup with Neptune

I remember once i came accross AWS Neptune optimization guide that i don't remember where is it now. It mentions that .dedup() step is not optimized for Neptune which makes performance worse. However, I have the following scenario where i need deduplicates and pagination at same time. So the only possible way in mind is to do .dedup() then .range() Or .groupCount() then select keys then range() But i am not sure if grouping does maintain the order all the time. What could be done?
Solution:
I guess what I'm getting at, is that I don't know of a way to make dedup() any more performant in that sort of query with Neptune's current implementation.
As far as pagination goes, have you tried using Neptune's Query Results Cache instead of making multiple range() calls? That would significantly decrease latency for subsequent calls as you paginate across the resuls: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-results-cache.html...
Caching query results in Amazon Neptune Gremlin - Amazon Neptune
Overview of using the query results cache with Gremlin.
Jump to solution
7 Replies
triggan
triggan9mo ago
I'd have to see the entire query to understand what you're attempting to do. Often times, when trying to rewrite queries, we take portions of the query that cannot be fully optimized and see if there are ways to push these portions of the query to the very end (allowing the majority of the query to be computed using native Neptune operators). You can issue the query through the Neptune Gremlin Explain API and the explain plan will state what portions are being optimized and which ones are not. One thing to remember with this, is that non-optimized steps that are nested will cause the parent steps to also be non-optimized. So if you see a situation where the explaining is stating that a portion of the query is not being optimized, then it may be due to having a dedup() or a fold() nested within that portion of the query. https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-explain.html
Analyzing Neptune query execution using Gremlin explain - Amazon Ne...
Use the Gremlin explain feature in Neptune to understand and improve your query execution.
M. alhaddad
M. alhaddadOP9mo ago
comp_ids = (
g.V()
.hasLabel("Word")
.has("term", "Neptune#fts {}*".format(term))
.in_("Has")
.in_("HasSite")
.range_(offset, offset + limit)
.values("company_id")
.fold()
.next()
)
comp_ids = (
g.V()
.hasLabel("Word")
.has("term", "Neptune#fts {}*".format(term))
.in_("Has")
.in_("HasSite")
.range_(offset, offset + limit)
.values("company_id")
.fold()
.next()
)
at in_("HasSite") i will have duplicate output as intended in the design. However, as a final output i want the paginated part to be not affected by duplicates so i am to apply range() on unique stuff. The performance would drop too much if i dedup() before range() because dedup() is not optimized
triggan
triggan9mo ago
I'm seeing dedup() getting pushed down. It's range() and beyond that isn't, but likely that wouldn't have that great of affect on performance:
Optimized Traversal
===================
Neptune steps:
[
NeptuneGraphQueryStep(Vertex) {
JoinGroupNode {
PatternNode[VL(?1, <~label>, ?2=<Word>, <~>) . project ?1 .], {estimatedCardinality=0}
PatternNode[VP(?1, <term>, "someterm", <~>) . project ask .], {estimatedCardinality=0}
PatternNode[EL(?3, ?5=<Has>, ?1, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=0}
PatternNode[EL(?7, ?9=<HasSite>, ?3, ?10) . project ?3,?7 . IsEdgeIdFilter(?10) .], {estimatedCardinality=0}
}, finishers=[dedup(?7)], {path=[Vertex(?1):GraphStep, Vertex(?3):VertexStep, Vertex(?7):VertexStep], maxVarId=11}
},
NeptuneTraverserConverterStep
]
+ not converted into Neptune steps: RangeGlobalStep(2,5),NeptunePropertiesStep([company_id],value),
Neptune steps:
[
NeptuneMemoryTrackerStep
]
+ not converted into Neptune steps: FoldStep,

WARNING: >> [RangeGlobalStep(2,5), FoldStep] << (or one of the children for each step) is not supported natively yet
Optimized Traversal
===================
Neptune steps:
[
NeptuneGraphQueryStep(Vertex) {
JoinGroupNode {
PatternNode[VL(?1, <~label>, ?2=<Word>, <~>) . project ?1 .], {estimatedCardinality=0}
PatternNode[VP(?1, <term>, "someterm", <~>) . project ask .], {estimatedCardinality=0}
PatternNode[EL(?3, ?5=<Has>, ?1, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=0}
PatternNode[EL(?7, ?9=<HasSite>, ?3, ?10) . project ?3,?7 . IsEdgeIdFilter(?10) .], {estimatedCardinality=0}
}, finishers=[dedup(?7)], {path=[Vertex(?1):GraphStep, Vertex(?3):VertexStep, Vertex(?7):VertexStep], maxVarId=11}
},
NeptuneTraverserConverterStep
]
+ not converted into Neptune steps: RangeGlobalStep(2,5),NeptunePropertiesStep([company_id],value),
Neptune steps:
[
NeptuneMemoryTrackerStep
]
+ not converted into Neptune steps: FoldStep,

WARNING: >> [RangeGlobalStep(2,5), FoldStep] << (or one of the children for each step) is not supported natively yet
Instead of fold().next(), could you not use toList()?
M. alhaddad
M. alhaddadOP9mo ago
what benefits does toList() bring? maybe i adapted fold & next since 3 years thats why. but do we have benefits or its just the same? that means i can do ?
.in_("HasSite")
.dedup()
.range_(offset, offset + limit)
.values("company_id")
.in_("HasSite")
.dedup()
.range_(offset, offset + limit)
.values("company_id")
so that won't affect the performance much? even if at in_("HasSite") we have thousands of results
triggan
triggan9mo ago
toList() serializes the result on the client side into a list in whatever runtime you're using. It's effectively doing the same thing, but you're removing the need of the engine to build the list and then return the entire list.
Solution
triggan
triggan9mo ago
I guess what I'm getting at, is that I don't know of a way to make dedup() any more performant in that sort of query with Neptune's current implementation.
As far as pagination goes, have you tried using Neptune's Query Results Cache instead of making multiple range() calls? That would significantly decrease latency for subsequent calls as you paginate across the resuls: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-results-cache.html
Caching query results in Amazon Neptune Gremlin - Amazon Neptune
Overview of using the query results cache with Gremlin.
M. alhaddad
M. alhaddadOP9mo ago
thanks i will check it out

Did you find this page helpful?