Isolated vertices vs connected vertices with no join benefit
Is there any downside to storing an isolated vertex with references to other nodes? Creating relationships makes the query more complicated than it needs to be, but storing references to other vertices seems like an anti-pattern/smell.
The relationship between nodes is defined as follows:
Querying this looks like:
With an isolated vertex:
Querying in this way produces the same result:
Please let me know if this doesn't make sense
Solution:Jump to solution
We're basically talking about denormalization here. denormalization is common for graphs as it is for relational data structures and it comes with the same drawbacks. Is this a case for denormalization? Based on this simple example, I'd say "no" because you really just have a single hop or so to collect what you need and you're done. But, I also don't know any other statistics about your data structure and other expected query patterns so it's hard to say with certainty that you shouldn't denormalize.
That said, if denormalization is the answer, do you denormalize to a wholly disconnected vertex? i still don't think i'd recommend that based on what i know. You're most painful traversal is a two hop of
__.in('relates_to_batch').out("related_to_reusable")
to get an id
or perhaps multiple ids. Considering adding a property of "reusableIds" to "batch-a" and store a List
of the ids there (or use multi-properties, https://tinkerpop.apache.org/docs/current/reference/#vertex-properties).
That seems like the most natural model to me since "isolated" is really just a "batch" with properties containing various things its connected to. Seems better to me to not introduce an "isolated" concept for that and denormalize to a thing that is actually part of your graph and connected. ...18 Replies
had trouble getting the sample data scripts to run in Gremlin Console. I'm further worried that the sample data isn't really representative of your data structure with all the different edge and vertex labels that are kinda mixed together. could we clean it up a bit before I dig into this, especially since others might be trying to follow along to learn?
after fixing syntax errors and step label references i have this:
is that the structure? if so, is it reasonable for us to have all these diverse vertex/edge labels in there? can they be more simplified? like, is there really a "link" or should those be "relates_to_reusable"? and could you define "connected"? is that just meant to be one of the vertices that is part of a "batch" that isn't one that you would reuse among batches?
also, are you using a graph that allows
id
assignment...if not, maybe we'd do better to use a property key instead of the T.id
for vertex identifiersYeah sorry! I didn’t actually try out the scripts before posting but was just trying to demonstrate the use case! The data structure is accurate and I think this is probably part of the problem; the “relates_to” connections are so loosely related which is kind of why I want to remove the edges entirely. The link between “batch” and “reusable” is concrete and has to stay. The importance behind the “connected” vertex is the content. For example, I might have another “connected” vertex that related to a “batch-c” but still “r2” but the “content” property would be entirely different. I need a way to get that “content” for the combination of “batch” and “reusable”. Yes, whatever needs to change is still possible to change. E.g if property keys make more sense, it’s doable. If reassigning ids makes more sense that’s also doable 👍🏼
ok, i think i see it now. so in full:
hmm - "reusableId" isn't in the one with
project()
ok - another little mistype. updatedYep looks good!
needed another edit for the "reusableId" to match up
in your real model, could a batch have multiple reusables?
Yes
On average there are 5 reusable per batch
in this example the reusable vertices are all directly connected to batch - is there often more hierarchy between them - like, more than one step away?
No it’s a direct connection between batch and reusable. There are additional edges coming off reusable but I’m not sure how relevant that is
could you point out where this approach "created weird and complex queries"? our example so far feels fairly graphy to me, but after working with our friend Gremlin for over a decade i have a different definition of "weird and complex" 👽
hahaha
I think what seemed "weird" to me:
Having to traverse in and out to get the data I was after. I was wondering the implication of just storing that data as property values
Especially since the links are so loosely related, it seems like there could be benefit to removing the edges entirely
We've discussed a lot here, a lot of it has probably been lost in translation haha
Also something like:
Is a lot more straight forward than the above, imo
Also not sure about the performance implications of each of these
Im also interested to know if the latter is considered an anti-pattern or code smell
Solution
We're basically talking about denormalization here. denormalization is common for graphs as it is for relational data structures and it comes with the same drawbacks. Is this a case for denormalization? Based on this simple example, I'd say "no" because you really just have a single hop or so to collect what you need and you're done. But, I also don't know any other statistics about your data structure and other expected query patterns so it's hard to say with certainty that you shouldn't denormalize.
That said, if denormalization is the answer, do you denormalize to a wholly disconnected vertex? i still don't think i'd recommend that based on what i know. You're most painful traversal is a two hop of
__.in('relates_to_batch').out("related_to_reusable")
to get an id
or perhaps multiple ids. Considering adding a property of "reusableIds" to "batch-a" and store a List
of the ids there (or use multi-properties, https://tinkerpop.apache.org/docs/current/reference/#vertex-properties).
That seems like the most natural model to me since "isolated" is really just a "batch" with properties containing various things its connected to. Seems better to me to not introduce an "isolated" concept for that and denormalize to a thing that is actually part of your graph and connected.
I'm still hesitant to say denormalize at all though. I suppose you can do it for ease of querying sake, but it's more often done for performance reasons. I also could be missing more context with this advice, but I think i'll stick with what I've posted here as an answer. The issue with the
reusableIds
to batch-a
is we are only really concerned with the ids that are connected both relates_to_batch
and related_to_reusable
via that connected
vertex. For example, we might have a connected
vertex that relates_to_batch
via batch-a
but might be related_to_reusable
via r2
-- These ids can't be contained within the same batch, since it's specific to the combination of batch and reusable. so I'm not too sure how it would work by adding the list to batch-a
. I hope this clarifies the questions you had about denormalization. Not sure if you wanna discuss this further, but happy to move on for now 🙂without following that example too closely (and maybe i have to in order to properly answer you), i guess my question is what's the difference between you doing:
and
Do you not get the same ease of querying using the second approach? I just don't see what "isolated" is doing for you that is special. it looks like a way to quickly get information about a "batch" but in some separate structure.
Those examples are practically the same. I'm also not tied to either approach, I think I'm just trying to understand what is the most "correct" way of doing things. I guess this kind of goes back to my original question about storing vertex references in another vertex? I think the term
batch
here makes this a bit confusing cos we are already using batch
for batch-a
. Is this meant to symbolize the same object? I think it's supposed to be connected
if we wanted to tie this closer to the example. Curious to know what the edge would connect to? Is it the batch
that contains batch-a
?
Sorry if we are going in circles!yeah, i think we have a few circles. it's ok though...
what confused me was your saying that "I'm not too sure how it would work by adding the list to batch-a" so my reply was to clarify that if you understand how to do the "isolated" way that you are proposing, then i'm just saying put it on your "batch" which is connected to everything else instead of using an isolated vertex. this is just a mechanism for denormalization and i'd say an "isolated" vertex isn't a pattern to follow in this case.
denormalize within your graph structure to preserve the main part of why you chose a graph in the first place - for it's connections
Right okay I think I get you now. I think it's probably important to note is that there is information in that external vertex that is useful to the use case other than the
batchId
and reusableId
-- content
was just one example, so I think it makes sense for it to be it's own vertex. Based on what you're saying, I think it probably makes sense just to stick with the relates_to_batch
and relates_to_reusable
example, we spoke about earlier.ok, well feel free to ask more questions if you get stuck or to just drop a note in #open-sharing as you continue with you work to let us know how you're doing. i'll probably mark this question as answered with my point about denormalization. that's probably the most general answer here for folks to learn from. thanks for the conversation <:gremlin_smile:1091719089807958067>
(as i use that as excuse to try out my new emoji) 🙂
Thanks so much! I really appreciate your time and patience 🙂