Apache TinkerPop•3y ago

Isolated vertices vs connected vertices with no join benefit

Is there any downside to storing an isolated vertex with references to other nodes? Creating relationships makes the query more complicated than it needs to be, but storing references to other vertices seems like an anti-pattern/smell. The relationship between nodes is defined as follows:

g.addV('batch').property(id,'batch-a').as('a').
  addV('batch').property(id,'batch-b')as('b').
  addV('reusable').property(id, 123).as('r1').
  addV('reusable').property(id, 123).as('r2').
  addE('link').from('a').to('r').
  addE('link').from('b').to('r2').iterate()

g.addV('batch').property(id,'batch-a').as('a').
  addV('batch').property(id,'batch-b')as('b').
  addV('reusable').property(id, 123).as('r1').
  addV('reusable').property(id, 123).as('r2').
  addE('link').from('a').to('r').
  addE('link').from('b').to('r2').iterate()

g.addV('connected').property('property", 'abc').as('coonected-v).addE('relates_to_batch').from('connected-a).to('batch-a').addE('related_to_reusable').from('connected-a').to('r1).iterate()

g.addV('connected').property('property", 'abc').as('coonected-v).addE('relates_to_batch').from('connected-a).to('batch-a').addE('related_to_reusable').from('connected-a').to('r1).iterate()

Querying this looks like:

g.V('batch-a').project('batchId', 'reusableId', 'connected').by(T.id).by(__.in('relates_to_batch').out("relates_to_reusable").id()).by(__.in("relates_to_batch").elementMap().fold()).toList()

g.V('batch-a').project('batchId', 'reusableId', 'connected').by(T.id).by(__.in('relates_to_batch').out("relates_to_reusable").id()).by(__.in("relates_to_batch").elementMap().fold()).toList()

With an isolated vertex:

g.addV('isoalted').property('property", 'abc').property('batchId', 'batch-a').property('reusableId', 'r2).iterate()

g.addV('isoalted').property('property", 'abc').property('batchId', 'batch-a').property('reusableId', 'r2).iterate()

Querying in this way produces the same result:

g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()

g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()

Please let me know if this doesn't make sense

Solution:

We're basically talking about denormalization here. denormalization is common for graphs as it is for relational data structures and it comes with the same drawbacks. Is this a case for denormalization? Based on this simple example, I'd say "no" because you really just have a single hop or so to collect what you need and you're done. But, I also don't know any other statistics about your data structure and other expected query patterns so it's hard to say with certainty that you shouldn't denormalize. That said, if denormalization is the answer, do you denormalize to a wholly disconnected vertex? i still don't think i'd recommend that based on what i know. You're most painful traversal is a two hop of __.in('relates_to_batch').out("related_to_reusable") to get an id or perhaps multiple ids. Considering adding a property of "reusableIds" to "batch-a" and store a List of the ids there (or use multi-properties, https://tinkerpop.apache.org/docs/current/reference/#vertex-properties). That seems like the most natural model to me since "isolated" is really just a "batch" with properties containing various things its connected to. Seems better to me to not introduce an "isolated" concept for that and denormalize to a thing that is actually part of your graph and connected. ...

Jump to solution

18 Replies

spmallette•3y ago

had trouble getting the sample data scripts to run in Gremlin Console. I'm further worried that the sample data isn't really representative of your data structure with all the different edge and vertex labels that are kinda mixed together. could we clean it up a bit before I dig into this, especially since others might be trying to follow along to learn? after fixing syntax errors and step label references i have this:

g = TinkerGraph.open().traversal()
g.addV('batch').property(id,'batch-a').as('a').
  addV('batch').property(id,'batch-b').as('b').
  addV('reusable').property(id, 123).as('r1').
  addV('reusable').property(id, 321).as('r2').
  addV('connected').property('property', 'abc').as('connected-a').
  addE('relates_to_batch').from('connected-a').to('a').
  addE('related_to_reusable').from('connected-a').to('r1').
  addE('link').from('a').to('r1').
  addE('link').from('b').to('r2').iterate()

g = TinkerGraph.open().traversal()
g.addV('batch').property(id,'batch-a').as('a').
  addV('batch').property(id,'batch-b').as('b').
  addV('reusable').property(id, 123).as('r1').
  addV('reusable').property(id, 321).as('r2').
  addV('connected').property('property', 'abc').as('connected-a').
  addE('relates_to_batch').from('connected-a').to('a').
  addE('related_to_reusable').from('connected-a').to('r1').
  addE('link').from('a').to('r1').
  addE('link').from('b').to('r2').iterate()

is that the structure? if so, is it reasonable for us to have all these diverse vertex/edge labels in there? can they be more simplified? like, is there really a "link" or should those be "relates_to_reusable"? and could you define "connected"? is that just meant to be one of the vertices that is part of a "batch" that isn't one that you would reuse among batches? also, are you using a graph that allows id assignment...if not, maybe we'd do better to use a property key instead of the T.id for vertex identifiers

LegendaryOP•3y ago

Yeah sorry! I didn’t actually try out the scripts before posting but was just trying to demonstrate the use case! The data structure is accurate and I think this is probably part of the problem; the “relates_to” connections are so loosely related which is kind of why I want to remove the edges entirely. The link between “batch” and “reusable” is concrete and has to stay. The importance behind the “connected” vertex is the content. For example, I might have another “connected” vertex that related to a “batch-c” but still “r2” but the “content” property would be entirely different. I need a way to get that “content” for the combination of “batch” and “reusable”. Yes, whatever needs to change is still possible to change. E.g if property keys make more sense, it’s doable. If reassigning ids makes more sense that’s also doable 👍🏼

spmallette•3y ago

ok, i think i see it now. so in full:

gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV('batch').property(id,'batch-a').as('a').
......1>   addV('batch').property(id,'batch-b').as('b').
......2>   addV('reusable').property(id, 123).as('r1').
......3>   addV('reusable').property(id, 321).as('r2').
......4>   addV('connected').property('property', 'abc').as('connected-a').
......5>   addE('relates_to_batch').from('connected-a').to('a').
......6>   addE('related_to_reusable').from('connected-a').to('r1').
......7>   addE('link').from('a').to('r1').
......8>   addE('link').from('b').to('r2').iterate()
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]
gremlin> g.V('batch-a').
......1>   project('batchId', 'reusableId', 'connected').
......2>     by(T.id).
......3>     by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4>     by(__.in("relates_to_batch").elementMap().fold())
==>[batchId:batch-a,reusableId:123,connected:[[id:0,label:connected,property:abc]]]
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]

gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV('batch').property(id,'batch-a').as('a').
......1>   addV('batch').property(id,'batch-b').as('b').
......2>   addV('reusable').property(id, 123).as('r1').
......3>   addV('reusable').property(id, 321).as('r2').
......4>   addV('connected').property('property', 'abc').as('connected-a').
......5>   addE('relates_to_batch').from('connected-a').to('a').
......6>   addE('related_to_reusable').from('connected-a').to('r1').
......7>   addE('link').from('a').to('r1').
......8>   addE('link').from('b').to('r2').iterate()
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]
gremlin> g.V('batch-a').
......1>   project('batchId', 'reusableId', 'connected').
......2>     by(T.id).
......3>     by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4>     by(__.in("relates_to_batch").elementMap().fold())
==>[batchId:batch-a,reusableId:123,connected:[[id:0,label:connected,property:abc]]]
gremlin> g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()
==>[id:6,label:isolated,reusableId:123,property:abc,batchId:batch-a]

hmm - "reusableId" isn't in the one with project() ok - another little mistype. updated

LegendaryOP•3y ago

Yep looks good!

spmallette•3y ago

needed another edit for the "reusableId" to match up in your real model, could a batch have multiple reusables?

LegendaryOP•3y ago

Yes On average there are 5 reusable per batch

spmallette•3y ago

in this example the reusable vertices are all directly connected to batch - is there often more hierarchy between them - like, more than one step away?

LegendaryOP•3y ago

No it’s a direct connection between batch and reusable. There are additional edges coming off reusable but I’m not sure how relevant that is

spmallette•3y ago

could you point out where this approach "created weird and complex queries"? our example so far feels fairly graphy to me, but after working with our friend Gremlin for over a decade i have a different definition of "weird and complex" 👽

LegendaryOP•3y ago

hahaha I think what seemed "weird" to me:

g.V('batch-a').
......1>   project('batchId', 'reusableId', 'connected').
......2>     by(T.id).
......3>     by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4>     by(__.in("relates_to_batch").elementMap().fold())

g.V('batch-a').
......1>   project('batchId', 'reusableId', 'connected').
......2>     by(T.id).
......3>     by(__.in('relates_to_batch').out("related_to_reusable").id()).
......4>     by(__.in("relates_to_batch").elementMap().fold())

Having to traverse in and out to get the data I was after. I was wondering the implication of just storing that data as property values Especially since the links are so loosely related, it seems like there could be benefit to removing the edges entirely We've discussed a lot here, a lot of it has probably been lost in translation haha Also something like:

g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()

g.V().hasLabel('isolated').has('batchId', 'batch-a').elementMap().toList()

Is a lot more straight forward than the above, imo Also not sure about the performance implications of each of these Im also interested to know if the latter is considered an anti-pattern or code smell

Solution

spmallette•3y ago

g.V().hasLabel('batch').has('batchId', 'batch-a').elementMap().toList()

g.V().hasLabel('batch').has('batchId', 'batch-a').elementMap().toList()

I'm still hesitant to say denormalize at all though. I suppose you can do it for ease of querying sake, but it's more often done for performance reasons. I also could be missing more context with this advice, but I think i'll stick with what I've posted here as an answer. gremlin

LegendaryOP•3y ago

The issue with the reusableIds to batch-a is we are only really concerned with the ids that are connected both relates_to_batch and related_to_reusable via that connected vertex. For example, we might have a connected vertex that relates_to_batch via batch-a but might be related_to_reusable via r2 -- These ids can't be contained within the same batch, since it's specific to the combination of batch and reusable. so I'm not too sure how it would work by adding the list to batch-a. I hope this clarifies the questions you had about denormalization. Not sure if you wanna discuss this further, but happy to move on for now 🙂

spmallette•3y ago

without following that example too closely (and maybe i have to in order to properly answer you), i guess my question is what's the difference between you doing:

g.addV('isolated').property('property', 'abc').
  property('batchId', 'batch-a').property('reusableId', '123')

g.addV('isolated').property('property', 'abc').
  property('batchId', 'batch-a').property('reusableId', '123')

and

g.addV('batch').property('property', 'abc').
  property('batchId', 'batch-a').property('reusableId', '123').
  addE().... // connected to all the other graph structure

g.addV('batch').property('property', 'abc').
  property('batchId', 'batch-a').property('reusableId', '123').
  addE().... // connected to all the other graph structure

Do you not get the same ease of querying using the second approach? I just don't see what "isolated" is doing for you that is special. it looks like a way to quickly get information about a "batch" but in some separate structure.

LegendaryOP•3y ago

Those examples are practically the same. I'm also not tied to either approach, I think I'm just trying to understand what is the most "correct" way of doing things. I guess this kind of goes back to my original question about storing vertex references in another vertex? I think the term batch here makes this a bit confusing cos we are already using batch for batch-a. Is this meant to symbolize the same object? I think it's supposed to be connected if we wanted to tie this closer to the example. Curious to know what the edge would connect to? Is it the batch that contains batch-a? Sorry if we are going in circles!

spmallette•3y ago

yeah, i think we have a few circles. it's ok though... what confused me was your saying that "I'm not too sure how it would work by adding the list to batch-a" so my reply was to clarify that if you understand how to do the "isolated" way that you are proposing, then i'm just saying put it on your "batch" which is connected to everything else instead of using an isolated vertex. this is just a mechanism for denormalization and i'd say an "isolated" vertex isn't a pattern to follow in this case. denormalize within your graph structure to preserve the main part of why you chose a graph in the first place - for it's connections

LegendaryOP•3y ago

Right okay I think I get you now. I think it's probably important to note is that there is information in that external vertex that is useful to the use case other than the batchId and reusableId -- content was just one example, so I think it makes sense for it to be it's own vertex. Based on what you're saying, I think it probably makes sense just to stick with the relates_to_batch and relates_to_reusable example, we spoke about earlier.

spmallette•3y ago

ok, well feel free to ask more questions if you get stuck or to just drop a note in #open-sharing as you continue with you work to let us know how you're doing. i'll probably mark this question as answered with my point about denormalization. that's probably the most general answer here for folks to learn from. thanks for the conversation <:gremlin_smile:1091719089807958067> (as i use that as excuse to try out my new emoji) 🙂

LegendaryOP•3y ago

Thanks so much! I really appreciate your time and patience 🙂

Gaming

Programming

Isolated vertices vs connected vertices with no join benefit

Did you find this page helpful?