Design decision related to multiple heterogenous relational graphs
I'm working with over 100k instances of heterogeneous, relational node-and-edge attributed graphs, each graph having around 5k vertices and 10k edges. Vertices are of 3 types with 10 attributes (7 numerical, 3 string), and edges are of 5 types with 8 attributes (4 numerical, 4 string). Considering the complexity and size of the data, running queries like traversal paths, average clustering coefficients, and identifying nodes in clustering triangles across all these instances presents a significant challenge.
I've been using a naive gremlin-server setup with an in-memory database to run my queries on one graph instance, but it's becoming clear that this approach isn't sustainable for multi-graph persistence or memory efficiency, as a single graph instance consumes about 1.2 GB of RAM. I'm exploring the possibility of switching to JanusGraph with a Berkeley DB backend to support persistent storage of multiple graphs (based on the feedback I got from the gremlin google group, https://groups.google.com/g/gremlin-users/c/UotOZFVvi3k/m/-hVd2oNNAQAJ).
Given the data structure and requirements, especially the need for efficient loading and querying of individual graph instances in a possibly serializable fashion, do you think JanusGraph with Berkeley DB is a viable solution, or are there alternative approaches I should consider for managing and querying this volume of graph data effectively?
I tried finding similar question, the closest matching question i found was https://discord.com/channels/838910279550238720/1087383361129037845, but was asking how to manage multiple graphs in gremlin-server.
Discord
Discord - A New Way to Chat with Friends & Communities
Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.
4 Replies
Hosting 100k small graph instances isn't a usage pattern I've seen a whole lot. JanusGraph seems like a reasonable choice to me, although I see you've been running into issues with conflicting vertex/edge id's. I'm unsure if JanusGraph supports non-globally unique id's in multiple graph deployments. My understanding is that JanusGraph generally recommends avoiding using user-defined id's whenever possible, in favour of automatically generated id's from JanusGraph.
Perhaps some @janusgraph folks with more familiarity with configuring multiple graphs can give some clearer advice for your setup.
Solution
No we actually recommend using user-defined IDs
But yeah I've never seen anyone hosting 100k small graph instances
In theory it should work, though
I might be wrong but IIRC different graphs could use same IDs without issues. In other words, there's no globally unique ID in a multi-tenant JanusGraph setup.
thank you so much guys. As I make progress, I will update this thread, in case someone asks for it in the future.