Direct and indirect Edges
This may also be an "it depends" question, but here I have labels
Or, to also to have:
A
, B
and C
where entity C
is created by entity B
on behalf of entity A
.
At the creation time of C
I know the instance of both B
and A
The cardinality will be roughly say ten B
s for each A
(though the B
s can be used by many A
s and there are potentially millions of C
s, which belong to only one B
Sometimes I want to start with all C
s for A
and sometimes with all C
s created by a particular B
on behalf of A
.
So my choice seems to be only have edges:
A
->B
->C
Or, to also to have:
A
->C
There will be storage for A
->C
of course, but given the low cardinality of B
, is A
->C
overkill and I can just find all C
by iterating B
for A
without too much cost.7 Replies
I think if you have just 10 A->B type edges there shouldn't be much need to have a shortcut edge from A->C. If there are tons of Cs in that equation then you could end up in a situation where you then slow the A->B traversals which would have been really fast without that. i think you mentioned elsewhere that you were using @neptune - i'm not sure if there is additional advice folks might offer specific to that graph database, but let's see if anyone else drops in some more comments.
The nice thing about graph data modeling is that it's pretty easy (and typically recommended) to iterate on the data model as needed to improve query performance. A key question will be how much filtering of the edges will you need to do? Will it be by label or by label and/or one or more properties? At face value starting with A -> B -> C seems resonable but until you start to write queries you may not know for sure what optimizations make sense. The key thing to worry about though is not so much the number of edges, crossing those is fast in Neptune, it's the amount of edge filtering that needs perhaps some more discussion.
OK - that's good reasoning. As I said in another post, the answer is likely in experimentation, which I will do. I was just wondering if there were any rule-of-thumb things to consider, and it seems that there are.
Just wondering why millions of A->C edges would slow A->B. Or you mean if I don't explicitly name/label the outgoing edges from A?
Being in Taiwan makes back and forth follow ups last a day 😉
I guess my statement depends on the graph to some degree. Generally though, I just meant that with A->B and A->C, A would now have supernode status whereas with just A->B there would be a simple 10 or so edges to reason about when traversing.
Just want to confirm, given:
a. g.V("09dc9684-fdb6-4729-887f-d653cd35a121").inE("soft-removed")
vs.
b. g.V("09dc9684-fdb6-4729-887f-d653cd35a121").inE("primeUser").has("linkKind", "soft-removed")
It is a LOT faster with a
, right, I hope!speaking purely from query execution time (not serialization) it depends on the graph. not all will optimize that. TinkerGraph would be fairly identical as all filtering is in-memory. JanusGraph would be faster, but only if you defined an index for "linkKind".
Kind of diverging from the main topic here, but as the
indexing
topic is brought up, I would assume that in Neptune
the performance will get better as certain properties are used often, if I understand the auto index builder (which I think is very nice.)