Isolated vertices vs connected vertices with no join benefit

Is there any downside to storing an isolated vertex with references to other nodes? Creating relationships makes the query more complicated than it needs to be, but storing references to other vertices seems like an anti-pattern/smell. The relationship between nodes is defined as follows:
g.addV('batch').property(id,'batch-a').as('a').
addV('batch').property(id,'batch-b')as('b').
addV('reusable').property(id, 123).as('r1').
addV('reusable').property(id, 123).as('r2').
addE('link').from('a').to('r').
addE('link').from('b').to('r2').iterate()
g.addV('batch').property(id,'batch-a').as('a').
addV('batch').property(id,'batch-b')as('b').
addV('reusable').property(id, 123).as('r1').
addV('reusable').property(id, 123).as('r2').
addE('link').from('a').to('r').
addE('link').from('b').to('r2').iterate()
g.addV('connected').property('property", 'abc').as('coonected-v).addE('relates_to_batch').from('connected-a).to('batch-a').addE('related_to_reusable').from('connected-a').to('r1).iterate()
g.addV('connected').property('property", 'abc').as('coonected-v).addE('relates_to_batch').from('connected-a).to('batch-a').addE('related_to_reusable').from('connected-a').to('r1).iterate()
Querying this looks like:
g.V('batch-a').project('batchId', 'reusableId', 'connected').by(T.id).by(__.in('relates_to_batch').out("relates_to_reusable").id()).by(__.in("relates_to_batch").elementMap().fold()).toList()
g.V('batch-a').project('batchId', 'reusableId', 'connected').by(T.id).by(__.in('relates_to_batch').out("relates_to_reusable").id()).by(__.in("relates_to_batch").elementMap().fold()).toList()
With an isolated vertex:
g.addV('isoalted').property('property", 'abc').property('batchId', 'batch-a').property('reusableId', 'r2).iterate()
g.addV('isoalted').property('property", 'abc').property('batchId', 'batch-a').property('reusableId', 'r2).iterate()
Querying in this way produces the same result:
g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
Please let me know if this doesn't make sense
Solution:
We're basically talking about denormalization here. denormalization is common for graphs as it is for relational data structures and it comes with the same drawbacks. Is this a case for denormalization? Based on this simple example, I'd say "no" because you really just have a single hop or so to collect what you need and you're done. But, I also don't know any other statistics about your data structure and other expected query patterns so it's hard to say with certainty that you shouldn't denormalize. That said, if denormalization is the answer, do you denormalize to a wholly disconnected vertex? i still don't think i'd recommend that based on what i know. You're most painful traversal is a two hop of __.in('relates_to_batch').out("related_to_reusable") to get an id or perhaps multiple ids. Considering adding a property of "reusableIds" to "batch-a" and store a List of the ids there (or use multi-properties, https://tinkerpop.apache.org/docs/current/reference/#vertex-properties). That seems like the most natural model to me since "isolated" is really just a "batch" with properties containing various things its connected to. Seems better to me to not introduce an "isolated" concept for that and denormalize to a thing that is actually part of your graph and connected. ...
Jump to solution
18 Replies
spmallette
spmallette2y ago
had trouble getting the sample data scripts to run in Gremlin Console. I'm further worried that the sample data isn't really representative of your data structure with all the different edge and vertex labels that are kinda mixed together. could we clean it up a bit before I dig into this, especially since others might be trying to follow along to learn? after fixing syntax errors and step label references i have this:
g = TinkerGraph.open().traversal()
g.addV('batch').property(id,'batch-a').as('a').
addV('batch').property(id,'batch-b').as('b').
addV('reusable').property(id, 123).as('r1').
addV('reusable').property(id, 321).as('r2').
addV('connected').property('property', 'abc').as('connected-a').
addE('relates_to_batch').from('connected-a').to('a').
addE('related_to_reusable').from('connected-a').to('r1').
addE('link').from('a').to('r1').
addE('link').from('b').to('r2').iterate()
g = TinkerGraph.open().traversal()
g.addV('batch').property(id,'batch-a').as('a').
addV('batch').property(id,'batch-b').as('b').
addV('reusable').property(id, 123).as('r1').
addV('reusable').property(id, 321).as('r2').
addV('connected').property('property', 'abc').as('connected-a').
addE('relates_to_batch').from('connected-a').to('a').
addE('related_to_reusable').from('connected-a').to('r1').
addE('link').from('a').to('r1').
addE('link').from('b').to('r2').iterate()
is that the structure? if so, is it reasonable for us to have all these diverse vertex/edge labels in there? can they be more simplified? like, is there really a "link" or should those be "relates_to_reusable"? and could you define "connected"? is that just meant to be one of the vertices that is part of a "batch" that isn't one that you would reuse among batches? also, are you using a graph that allows id assignment...if not, maybe we'd do better to use a property key instead of the T.id for vertex identifiers
Legendary
LegendaryOP2y ago
Yeah sorry! I didn’t actually try out the scripts before posting but was just trying to demonstrate the use case! The data structure is accurate and I think this is probably part of the problem; the “relates_to” connections are so loosely related which is kind of why I want to remove the edges entirely. The link between “batch” and “reusable” is concrete and has to stay. The importance behind the “connected” vertex is the content. For example, I might have another “connected” vertex that related to a “batch-c” but still “r2” but the “content” property would be entirely different. I need a way to get that “content” for the combination of “batch” and “reusable”. Yes, whatever needs to change is still possible to change. E.g if property keys make more sense, it’s doable. If reassigning ids makes more sense that’s also doable 👍🏼
spmallette
spmallette2y ago
ok, i think i see it now. so in full:
gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV('batch').property(id,'batch-a').as('a').
......1> addV('batch').property(id,'batch-b').as('b').
......2> addV('reusable').property(id, 123).as('r1').
......3> addV('reusable').property(id, 321).as('r2').
......4> addV('connected').property('property', 'abc').as('connected-a').
......5> addE('relates_to_batch').from('connected-a').to('a').
......6> addE('related_to_reusable').from('connected-a').to('r1').
......7> addE('link').from('a').to('r1').
......8> addE('link').from('b').to('r2').iterate()
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]
gremlin> g.V('batch-a').
......1> project('batchId', 'reusableId', 'connected').
......2> by(T.id).
......3> by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4> by(__.in("relates_to_batch").elementMap().fold())
==>[batchId:batch-a,reusableId:123,connected:[[id:0,label:connected,property:abc]]]
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]
gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV('batch').property(id,'batch-a').as('a').
......1> addV('batch').property(id,'batch-b').as('b').
......2> addV('reusable').property(id, 123).as('r1').
......3> addV('reusable').property(id, 321).as('r2').
......4> addV('connected').property('property', 'abc').as('connected-a').
......5> addE('relates_to_batch').from('connected-a').to('a').
......6> addE('related_to_reusable').from('connected-a').to('r1').
......7> addE('link').from('a').to('r1').
......8> addE('link').from('b').to('r2').iterate()
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]
gremlin> g.V('batch-a').
......1> project('batchId', 'reusableId', 'connected').
......2> by(T.id).
......3> by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4> by(__.in("relates_to_batch").elementMap().fold())
==>[batchId:batch-a,reusableId:123,connected:[[id:0,label:connected,property:abc]]]
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]
hmm - "reusableId" isn't in the one with project() ok - another little mistype. updated
Legendary
LegendaryOP2y ago
Yep looks good!
spmallette
spmallette2y ago
needed another edit for the "reusableId" to match up in your real model, could a batch have multiple reusables?
Legendary
LegendaryOP2y ago
Yes On average there are 5 reusable per batch
spmallette
spmallette2y ago
in this example the reusable vertices are all directly connected to batch - is there often more hierarchy between them - like, more than one step away?
Legendary
LegendaryOP2y ago
No it’s a direct connection between batch and reusable. There are additional edges coming off reusable but I’m not sure how relevant that is
spmallette
spmallette2y ago
could you point out where this approach "created weird and complex queries"? our example so far feels fairly graphy to me, but after working with our friend Gremlin for over a decade i have a different definition of "weird and complex" 👽
Legendary
LegendaryOP2y ago
hahaha I think what seemed "weird" to me:
g.V('batch-a').
......1> project('batchId', 'reusableId', 'connected').
......2> by(T.id).
......3> by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4> by(__.in("relates_to_batch").elementMap().fold())
g.V('batch-a').
......1> project('batchId', 'reusableId', 'connected').
......2> by(T.id).
......3> by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4> by(__.in("relates_to_batch").elementMap().fold())
Having to traverse in and out to get the data I was after. I was wondering the implication of just storing that data as property values Especially since the links are so loosely related, it seems like there could be benefit to removing the edges entirely We've discussed a lot here, a lot of it has probably been lost in translation haha Also something like:
g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
Is a lot more straight forward than the above, imo Also not sure about the performance implications of each of these Im also interested to know if the latter is considered an anti-pattern or code smell
Solution
spmallette
spmallette2y ago
We're basically talking about denormalization here. denormalization is common for graphs as it is for relational data structures and it comes with the same drawbacks. Is this a case for denormalization? Based on this simple example, I'd say "no" because you really just have a single hop or so to collect what you need and you're done. But, I also don't know any other statistics about your data structure and other expected query patterns so it's hard to say with certainty that you shouldn't denormalize. That said, if denormalization is the answer, do you denormalize to a wholly disconnected vertex? i still don't think i'd recommend that based on what i know. You're most painful traversal is a two hop of __.in('relates_to_batch').out("related_to_reusable") to get an id or perhaps multiple ids. Considering adding a property of "reusableIds" to "batch-a" and store a List of the ids there (or use multi-properties, https://tinkerpop.apache.org/docs/current/reference/#vertex-properties). That seems like the most natural model to me since "isolated" is really just a "batch" with properties containing various things its connected to. Seems better to me to not introduce an "isolated" concept for that and denormalize to a thing that is actually part of your graph and connected.
g.V().hasLabel('batch').has('batchId', 'batch-a').elementMap().toList()
g.V().hasLabel('batch').has('batchId', 'batch-a').elementMap().toList()
I'm still hesitant to say denormalize at all though. I suppose you can do it for ease of querying sake, but it's more often done for performance reasons. I also could be missing more context with this advice, but I think i'll stick with what I've posted here as an answer. gremlin
Legendary
LegendaryOP2y ago
The issue with the reusableIds to batch-a is we are only really concerned with the ids that are connected both relates_to_batch and related_to_reusable via that connected vertex. For example, we might have a connected vertex that relates_to_batch via batch-a but might be related_to_reusable via r2 -- These ids can't be contained within the same batch, since it's specific to the combination of batch and reusable. so I'm not too sure how it would work by adding the list to batch-a. I hope this clarifies the questions you had about denormalization. Not sure if you wanna discuss this further, but happy to move on for now 🙂
spmallette
spmallette2y ago
without following that example too closely (and maybe i have to in order to properly answer you), i guess my question is what's the difference between you doing:
g.addV('isolated').property('property', 'abc').
property('batchId', 'batch-a').property('reusableId', '123')
g.addV('isolated').property('property', 'abc').
property('batchId', 'batch-a').property('reusableId', '123')
and
g.addV('batch').property('property', 'abc').
property('batchId', 'batch-a').property('reusableId', '123').
addE().... // connected to all the other graph structure
g.addV('batch').property('property', 'abc').
property('batchId', 'batch-a').property('reusableId', '123').
addE().... // connected to all the other graph structure
Do you not get the same ease of querying using the second approach? I just don't see what "isolated" is doing for you that is special. it looks like a way to quickly get information about a "batch" but in some separate structure.
Legendary
LegendaryOP2y ago
Those examples are practically the same. I'm also not tied to either approach, I think I'm just trying to understand what is the most "correct" way of doing things. I guess this kind of goes back to my original question about storing vertex references in another vertex? I think the term batch here makes this a bit confusing cos we are already using batch for batch-a. Is this meant to symbolize the same object? I think it's supposed to be connected if we wanted to tie this closer to the example. Curious to know what the edge would connect to? Is it the batch that contains batch-a? Sorry if we are going in circles!
spmallette
spmallette2y ago
yeah, i think we have a few circles. it's ok though... what confused me was your saying that "I'm not too sure how it would work by adding the list to batch-a" so my reply was to clarify that if you understand how to do the "isolated" way that you are proposing, then i'm just saying put it on your "batch" which is connected to everything else instead of using an isolated vertex. this is just a mechanism for denormalization and i'd say an "isolated" vertex isn't a pattern to follow in this case. denormalize within your graph structure to preserve the main part of why you chose a graph in the first place - for it's connections
Legendary
LegendaryOP2y ago
Right okay I think I get you now. I think it's probably important to note is that there is information in that external vertex that is useful to the use case other than the batchId and reusableId -- content was just one example, so I think it makes sense for it to be it's own vertex. Based on what you're saying, I think it probably makes sense just to stick with the relates_to_batch and relates_to_reusable example, we spoke about earlier.
spmallette
spmallette2y ago
ok, well feel free to ask more questions if you get stuck or to just drop a note in #open-sharing as you continue with you work to let us know how you're doing. i'll probably mark this question as answered with my point about denormalization. that's probably the most general answer here for folks to learn from. thanks for the conversation <:gremlin_smile:1091719089807958067> (as i use that as excuse to try out my new emoji) 🙂
Legendary
LegendaryOP2y ago
Thanks so much! I really appreciate your time and patience 🙂
Want results from more Discord servers?
Add your server