Direct and indirect Edges

This may also be an "it depends" question, but here I have labels A, B and C where entity C is created by entity B on behalf of entity A. At the creation time of C I know the instance of both B and A The cardinality will be roughly say ten Bs for each A (though the Bs can be used by many As and there are potentially millions of Cs, which belong to only one B Sometimes I want to start with all Cs for A and sometimes with all Cs created by a particular B on behalf of A. So my choice seems to be only have edges: A->B->C
Or, to also to have: A->C There will be storage for A->C of course, but given the low cardinality of B, is A->C overkill and I can just find all C by iterating B for A without too much cost.
7 Replies
spmallette
spmallette•2y ago
I think if you have just 10 A->B type edges there shouldn't be much need to have a shortcut edge from A->C. If there are tons of Cs in that equation then you could end up in a situation where you then slow the A->B traversals which would have been really fast without that. i think you mentioned elsewhere that you were using @neptune - i'm not sure if there is additional advice folks might offer specific to that graph database, but let's see if anyone else drops in some more comments.
kelvinl2816
kelvinl2816•2y ago
The nice thing about graph data modeling is that it's pretty easy (and typically recommended) to iterate on the data model as needed to improve query performance. A key question will be how much filtering of the edges will you need to do? Will it be by label or by label and/or one or more properties? At face value starting with A -> B -> C seems resonable but until you start to write queries you may not know for sure what optimizations make sense. The key thing to worry about though is not so much the number of edges, crossing those is fast in Neptune, it's the amount of edge filtering that needs perhaps some more discussion.
Jim Idle
Jim IdleOP•2y ago
OK - that's good reasoning. As I said in another post, the answer is likely in experimentation, which I will do. I was just wondering if there were any rule-of-thumb things to consider, and it seems that there are. Just wondering why millions of A->C edges would slow A->B. Or you mean if I don't explicitly name/label the outgoing edges from A? Being in Taiwan makes back and forth follow ups last a day 😉
spmallette
spmallette•2y ago
I guess my statement depends on the graph to some degree. Generally though, I just meant that with A->B and A->C, A would now have supernode status whereas with just A->B there would be a simple 10 or so edges to reason about when traversing.
ManabuBeach
ManabuBeach•2y ago
Just want to confirm, given: a. g.V("09dc9684-fdb6-4729-887f-d653cd35a121").inE("soft-removed") vs. b. g.V("09dc9684-fdb6-4729-887f-d653cd35a121").inE("primeUser").has("linkKind", "soft-removed") It is a LOT faster with a, right, I hope!
spmallette
spmallette•2y ago
speaking purely from query execution time (not serialization) it depends on the graph. not all will optimize that. TinkerGraph would be fairly identical as all filtering is in-memory. JanusGraph would be faster, but only if you defined an index for "linkKind".
ManabuBeach
ManabuBeach•2y ago
Kind of diverging from the main topic here, but as the indexing topic is brought up, I would assume that in Neptune the performance will get better as certain properties are used often, if I understand the auto index builder (which I think is very nice.)

Did you find this page helpful?