How to create indexes by Label?
In search of performance improvements, the AWS Neptune experts suggested that I create some indexes. To better contextualize, I have 3 operations in a single POST endpoint with the database. A query of previous data bringing the relationships of a specific ID, a deletion of edges if there is a registration in the database and a registration/update of vertices and edges. Today I am trying to attack two problems. Improve the performance of the creation that takes approximately 150ms and improve the performance of the query that is currently bogging down between 1.2-17 seconds.
Is it possible to create an index for vertexes and edges by specifying them by label since I have vertices and edges with different labels that have different properties? Does anyone know what this implementation would look like? In my current implementation I do it in a simple way as follows: client_write = client.Client(neptune_url, "g", message_serializer=serializer.GraphSONMessageSerializer()) queries = [ "graph.createIndex('journey_id', Vertex.class)", "graph.createIndex('person_type', Vertex.class)", "graph.createIndex('relationship_type', Edge.class)" ] for query in queries: client_write.submit(query).all().result()
Is it possible to create an index for vertexes and edges by specifying them by label since I have vertices and edges with different labels that have different properties? Does anyone know what this implementation would look like? In my current implementation I do it in a simple way as follows: client_write = client.Client(neptune_url, "g", message_serializer=serializer.GraphSONMessageSerializer()) queries = [ "graph.createIndex('journey_id', Vertex.class)", "graph.createIndex('person_type', Vertex.class)", "graph.createIndex('relationship_type', Edge.class)" ] for query in queries: client_write.submit(query).all().result()
8 Replies
I’m testing Janusgraph and have the same label index issue. If I use hasLabel() it performs a full scan. What I did is add a fake ‘label’ property on vertices with a mixed index on it. Then I only use filtering on this property (it’s ok for now with a small graph of 500k nodes)
the AWS Neptune experts suggested that I create some indexesWhich experts were these? Neptune doesn't support the creation of indexes beyond the 3 native indexes that are created by default. There's a fourth optional index, but only needed for very specific use cases: https://docs.aws.amazon.com/neptune/latest/userguide/features-lab-mode.html#features-lab-mode-features-osgp-index
Neptune Lab Mode - Amazon Neptune
Use Neptune lab mode to enable new features that are present in the current release.
Neptune's indexing structure can be explained here: https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html
Neptune Graph Data Model - Amazon Neptune
Learn about the four positions of a Neptune quad element.
If you have some more details on the query, I might be able to help determine the best way to (re)write that to take advantage of the current indexes. Or potentially rework your data model to better fit within the existing indexes.
Which experts were these?
AWS consultant who offer support to the company I work for. They work for amazon.
I send this query using submit on client.Client to return the saved data:
g.V().has(T.id, "client-id-uuid").bothE("has_profile", "has_affiliated", "has_controlling").has(T.id, containing("tenant-id-uuid")).otherV().path().unfold().dedup().elementMap().toList()
and I use this one to insert the data:
g.mergeV([(id): 'client-id-uuid']).option(onCreate, [
(label): "Person",
''journey_id': "journey-id-uuid",
'person_type': "F",
'document': "str(document.model_dump())",
'included_at': '2024-12-11',
'updated_at': '2024-12-11'
]).option(onMatch, ['updated_at': '2024-12-11']).mergeV([(id): 'client-id-uuid']).option(onCreate, [ (label): "Person", ''journey_id': "journey-id-uuid", 'person_type': "F", 'document': "str(document.model_dump())", 'included_at': '2024-12-11', 'updated_at': '2024-12-11'
]).option(onMatch, ['updated_at': '2024-12-11'])....
]).option(onMatch, ['updated_at': '2024-12-11']).mergeV([(id): 'client-id-uuid']).option(onCreate, [ (label): "Person", ''journey_id': "journey-id-uuid", 'person_type': "F", 'document': "str(document.model_dump())", 'included_at': '2024-12-11', 'updated_at': '2024-12-11'
]).option(onMatch, ['updated_at': '2024-12-11'])....
This is might be one of your issues:
Neptune does not have a Full Text Search index. Using any of the text predicates (i.e.
containing()
, startswith()
, endswith
, etc.) will require dedictionarifying the values for each of the solutions up to that portion of the query. If that is a common pattern, I might suggest using a different property so you can do just a has(key, value)
filter.@triggan I found that just by calling g.V().limit(1) with concurrent calls on an r6g.2xlarge machine, the average time is 250ms, which I consider very high. But in the query mentioned, the bottleneck starts at the stage where it calls the last otherV() before path().
g.V().has(T.id, "client-id-uuid").bothE("has_profile", "has_affiliated", "has_controlling").has(T.id, containing("tenant-id-uuid")).otherV().path().unfold().dedup().elementMap().toList()
Even trying other approaches, such as "inV()", and removing dedup(), the processing time is still very high. It takes 4-12s just to return the query. In the environment where I perform the test, I have 370k vertices and 360k edges. If I just make the call like this without bringing the rest of the data g.V().has(T.id, "client-id-uuid").bothE("has_profile", "has_affiliated", "has_controlling").has(T.id, containing("tenant-id-uuid")).path().unfold().dedup().elementMap().toList()
it returns in 500ms.
Do you know what other approach I could use to process this last item "otherV()" without bottlenecking?just by calling g.V().limit(1) with concurrent calls on an r6g.2xlarge machine, the average time is 250msHow may concurrent calls? An
r6g.2xlarge
instance has 8 vCPUs (and 16 available query execution threads). If you're issuing more than 16 requests in parallel, any additional concurrent requests will queue (an instance can queue up to 8000 requests). You can see this with the /gremlin/status
API (of %gremlin_status
Jupyter magic) with the number of executing queries and the number of "accepted" queries. If you need more concurrency, then you'll need to add more vCPUs (either by scaling up or scaling out read replicas).
But in the query mentioned, the bottleneck starts at the stage where it calls the last otherV() before path(). g.V().has(T.id, "client-id-uuid").bothE("has_profile", "has_affiliated", "has_controlling").has(T.id, containing("tenant-id-uuid")).otherV().path().unfold().dedup().elementMap().toList()Makes sense as you're using a text predicate here (
containing()
). Neptune does not maintain a Full Text Search index. So any use of text predicates as containing()
, startingWith()
, endingWith()
etc. will incur some form of range scan and also require dictionary materialization (we lose all of the benefits of data compression here as each value must be fetched from the dictionary to compare with the predicate value you've provided).
The data modeling looks a bit odd here. I'm not sure I would try to use any component of an edge ID as a filter. At that point, you're sort of attempting to use the edges to model some form of entity. This can be a bit of an anti-pattern. Edges are meant to represent relationships (actions, verbs) in a graph where nodes/vertices are meant to represent entities (nouns, things). If this is a common query pattern, you may want to look at further de-normalizing the data model and creating a labeled node of Tenant. Executing a query of g.V(<client_id>).repeat(both(<list_of_edge_labels>).simplePath()).times(2).path()
should perform better than what you currently have.