Analyzing samples of Gremlin Queries in Neptune Notebook

Hey everyone, I’m working on a project where we give internal customers access to our Neptune graph through Neptune Notebook. There are already quite a few users, and we want to analyze the queries they run to see which parts of our ontology are used more and which parts are less utilized. This is not as straight-forward as retrieving all labels from the query, since our edge labels are not unique, and if people would be using .in or .out steps without clarifying the entity name, it's almost impossible to analyze which part of ontology was visited. We also want to identify common query patterns to understand what people are usually querying for and which connections in our ontology are the most frequently used, but also filtering out some common to all queries parts, like g.V() or g.V(), retrieving rather information about combinations of multiple steps that were called. We’ve figured out how to override the Gremlin magic in Neptune Notebook to add our custom logic to handle each query. And for my problem I’m considering two approaches: - Running the Gremlin profiler on each query to get detailed info on the nodes visited and then applying custom language analysis algorithms. - Collecting this data and feeding it into an LLM to summarize the queries and answer questions I'm interested in. Has anyone here done something similar or have any insights on this? Would love to hear about your experiences or any advice you might have!
Solution:
I think this is going to depend on how granular you want to get. If the intent is to see what labeled vertices or edges are accessed, then just looking at a query in the audit log would be sufficient. But, if your intent is to see every atomic component that is accessed in the database as part of query execution, that could be expensive. It is possible, though. You could run every query through the Neptune Gremlin Profiler: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html and set profile.indexOps to True and you'll get an output at the bottom of the profile output with every index operation that occurs. These will equate to some permutation of S-P-O-G patterns that are used in the three different built-in indexes (or fourth index, if enabled).
With the list of indexed lookup patterns, you could possibly maintain an external counter (maybe in sorted set in Redis/Valkey) with a a key of the S-P-O-G combination and the value being the number of times accessed. Just be aware that attaining a Neptune Gremlin Profile output requires that you run the query again. So you may not be able to use this to capture writes (without rewriting the data) and it will incur additional database resources to re-run all of the read queries....
Gremlin profile API in Neptune - Amazon Neptune
The Gremlin profile feature in Neptune runs a specified traversal, collects metrics about the run, and produces a profile report.
Jump to solution
5 Replies
Kennh
Kennh7mo ago
I'm not familiar with trying to figure out which portion of the graph is most frequently used based on the Gremlin queries issued against it. There might be some Neptune-specific functionality that can help as well. Does anyone from @neptune have any insights that could help?
ManabuBeach
ManabuBeach7mo ago
Comment: If you capture the audit log you can definitely capture every query issued. Most of my queries starts with either a label or specific Vertex ID so on my case I will have a good idea what portion is accessed a lot. I also have updatedOn property on both V and E in my schema. Queries involving property is extremely efficient as they will auto index. Gremlin allows queries by just property key and values. In my case I could use that to deduce the busiest route in the graph. I am also interested to know if other ways exist.
Solution
triggan
triggan7mo ago
I think this is going to depend on how granular you want to get. If the intent is to see what labeled vertices or edges are accessed, then just looking at a query in the audit log would be sufficient. But, if your intent is to see every atomic component that is accessed in the database as part of query execution, that could be expensive. It is possible, though. You could run every query through the Neptune Gremlin Profiler: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html and set profile.indexOps to True and you'll get an output at the bottom of the profile output with every index operation that occurs. These will equate to some permutation of S-P-O-G patterns that are used in the three different built-in indexes (or fourth index, if enabled).
With the list of indexed lookup patterns, you could possibly maintain an external counter (maybe in sorted set in Redis/Valkey) with a a key of the S-P-O-G combination and the value being the number of times accessed. Just be aware that attaining a Neptune Gremlin Profile output requires that you run the query again. So you may not be able to use this to capture writes (without rewriting the data) and it will incur additional database resources to re-run all of the read queries.
Gremlin profile API in Neptune - Amazon Neptune
The Gremlin profile feature in Neptune runs a specified traversal, collects metrics about the run, and produces a profile report.
triggan
triggan7mo ago
If you're not familiar with SPOG and what that equates to in Neptune, here's the document that explains it: https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html
Neptune Graph Data Model - Amazon Neptune
Learn about the four positions of a Neptune quad element.
ManabuBeach
ManabuBeach7mo ago
Mr SPOG (subject, predicate, object, graph) comes up a lot. The SPOG page is an important architectural briefing to all Neptune dealing developers.
Want results from more Discord servers?
Add your server