Apache TinkerPop•14mo ago

Design decision related to multiple heterogenous relational graphs

I'm working with over 100k instances of heterogeneous, relational node-and-edge attributed graphs, each graph having around 5k vertices and 10k edges. Vertices are of 3 types with 10 attributes (7 numerical, 3 string), and edges are of 5 types with 8 attributes (4 numerical, 4 string). Considering the complexity and size of the data, running queries like traversal paths, average clustering coefficients, and identifying nodes in clustering triangles across all these instances presents a significant challenge. I've been using a naive gremlin-server setup with an in-memory database to run my queries on one graph instance, but it's becoming clear that this approach isn't sustainable for multi-graph persistence or memory efficiency, as a single graph instance consumes about 1.2 GB of RAM. I'm exploring the possibility of switching to JanusGraph with a Berkeley DB backend to support persistent storage of multiple graphs (based on the feedback I got from the gremlin google group, https://groups.google.com/g/gremlin-users/c/UotOZFVvi3k/m/-hVd2oNNAQAJ). Given the data structure and requirements, especially the need for efficient loading and querying of individual graph instances in a possibly serializable fashion, do you think JanusGraph with Berkeley DB is a viable solution, or are there alternative approaches I should consider for managing and querying this volume of graph data effectively? I tried finding similar question, the closest matching question i found was https://discord.com/channels/838910279550238720/1087383361129037845, but was asking how to manage multiple graphs in gremlin-server.

Discord

Discord - A New Way to Chat with Friends & Communities

Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.

Solution:

No we actually recommend using user-defined IDs

Jump to solution

4 Replies

ColeGreer•14mo ago

Hosting 100k small graph instances isn't a usage pattern I've seen a whole lot. JanusGraph seems like a reasonable choice to me, although I see you've been running into issues with conflicting vertex/edge id's. I'm unsure if JanusGraph supports non-globally unique id's in multiple graph deployments. My understanding is that JanusGraph generally recommends avoiding using user-defined id's whenever possible, in favour of automatically generated id's from JanusGraph. Perhaps some @janusgraph folks with more familiarity with configuring multiple graphs can give some clearer advice for your setup.

Solution

Bo•14mo ago

No we actually recommend using user-defined IDs

Bo•14mo ago

But yeah I've never seen anyone hosting 100k small graph instances In theory it should work, though I might be wrong but IIRC different graphs could use same IDs without issues. In other words, there's no globally unique ID in a multi-tenant JanusGraph setup.

dracule_redroseOP•14mo ago

thank you so much guys. As I make progress, I will update this thread, in case someone asks for it in the future.

Gaming

Programming

Design decision related to multiple heterogenous relational graphs

Did you find this page helpful?