Very slow regex query (AWS Neptune)

We have a query that searches a data set of about ~400,000 vertices, matching properties using a case insensitive TextP.regex() expression. We are observing very bad query performance; even after several other optimizations, it still takes 20-45 seconds, often timing out. Simplified query:
g.V()
.has_label("foo").or_(
__.has("property_1", TextP.regex("(?i)^bar")),
__.has("property_2", TextP.regex("(?i)^bar")),
__.has("property_3", TextP.regex("(?i)^bar")),
)
.order()
.by("date", Order.asc)
.limit(1)
.value_map(True)
g.V()
.has_label("foo").or_(
__.has("property_1", TextP.regex("(?i)^bar")),
__.has("property_2", TextP.regex("(?i)^bar")),
__.has("property_3", TextP.regex("(?i)^bar")),
)
.order()
.by("date", Order.asc)
.limit(1)
.value_map(True)
We are on a db.r6g.xlarge instance, and do NOT observe any meaningful CPU or memory spikes from this query. We have profiled the query and the TextP.regex() portion seems to take 99%+ of the total runtime. We're looking for any information that might help us optimize this query or at least understand the poor performance a little better. Thanks in advance!
Solution:
When Neptune stores data it stores it in 3 different indexed formats (https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html#feature-overview-storage-indexing), each of which are optimized for a specific set of common graph patterns. Each of these indexes is optimized for exact match lookups so when running queries that require partial text matches, such as a regex query, all the matching property data needs to be scanned to see if it matches the provided expression.
To get a performant query for partial text matches the suggestion is to use the Full Text search integration (https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html) , which will integrate with OpenSearch to provide robust full text searching capabilities within a Gremlin query...
Neptune Graph Data Model - Amazon Neptune
Learn about the four positions of a Neptune quad element.
Jump to solution
4 Replies
Solution
Dave
Dave4mo ago
When Neptune stores data it stores it in 3 different indexed formats (https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html#feature-overview-storage-indexing), each of which are optimized for a specific set of common graph patterns. Each of these indexes is optimized for exact match lookups so when running queries that require partial text matches, such as a regex query, all the matching property data needs to be scanned to see if it matches the provided expression.
To get a performant query for partial text matches the suggestion is to use the Full Text search integration (https://docs.aws.amazon.com/neptune/latest/userguide/full-text-search.html) , which will integrate with OpenSearch to provide robust full text searching capabilities within a Gremlin query
Neptune Graph Data Model - Amazon Neptune
Learn about the four positions of a Neptune quad element.
Andys1814
Andys1814OP4mo ago
@Dave Thank you for the information! We have seen that OpenSearch is the preferred strategy, but this would take a while for us to implement. Do you have any shorter term recommendations to improve this lookup before we can do OpenSearch?
Dave
Dave4mo ago
If 99% of the time is being spent on the regex then there is not much you can do from an optimization perspective outside of adding FTS support or redoing the data model to make them exact lookups
triggan
triggan4mo ago
To Dave's point... there are some data modeling strategies that you can use to eliminate the need for OpenSearch and use exact match references in your cluster: 1) If case is an issue, create an all lower or upper-case version of the property value and use that to match. 2) If searching for a specific term (and when you know you might be looking for that term again in the future), just do a one-time search + add for that term. Properties in Neptune have a default cardinality of set, so each property can contain multiple values. A good example of this is if I wanted to do a lookup of all movies in a database that contained "Star Wars: Episode" in the title.
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).values('title')
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).values('title')
This query takes ~5.5s to run across all movies in my dataset. I can add Star Wars: Episode as an additional title property value to each of those via:
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).property('title','Star Wars: Episode')
g.V().hasLabel('movie').has('title',regex('Star Wars: Episode (I|V)')).property('title','Star Wars: Episode')
Then I can just use a query like:
g.V().hasLabel('movie').has('title','Star Wars: Episode').values('title')
g.V().hasLabel('movie').has('title','Star Wars: Episode').values('title')
which now only takes 1.3ms to run to find the results. You could also store the exact match term as a separate property value too. I'm just storing it back into title for simplicity sake here. This pattern is only beneficial when you know you're going to need to find things multiple times. For adhoc searches, using OpenSearch is really the only solution. Neptune has no means, at present, to build an internal full-text-search index.
Want results from more Discord servers?
Add your server