Andys1814
Andys1814
ATApache TinkerPop
Created by Andys1814 on 8/26/2024 in #questions
Very slow regex query (AWS Neptune)
We have a query that searches a data set of about ~400,000 vertices, matching properties using a case insensitive TextP.regex() expression. We are observing very bad query performance; even after several other optimizations, it still takes 20-45 seconds, often timing out. Simplified query:
g.V()
.has_label("foo").or_(
__.has("property_1", TextP.regex("(?i)^bar")),
__.has("property_2", TextP.regex("(?i)^bar")),
__.has("property_3", TextP.regex("(?i)^bar")),
)
.order()
.by("date", Order.asc)
.limit(1)
.value_map(True)
g.V()
.has_label("foo").or_(
__.has("property_1", TextP.regex("(?i)^bar")),
__.has("property_2", TextP.regex("(?i)^bar")),
__.has("property_3", TextP.regex("(?i)^bar")),
)
.order()
.by("date", Order.asc)
.limit(1)
.value_map(True)
We are on a db.r6g.xlarge instance, and do NOT observe any meaningful CPU or memory spikes from this query. We have profiled the query and the TextP.regex() portion seems to take 99%+ of the total runtime. We're looking for any information that might help us optimize this query or at least understand the poor performance a little better. Thanks in advance!
6 replies
ATApache TinkerPop
Created by Andys1814 on 11/7/2023 in #questions
Sequential IDs in Neptune?
@neptune I'm attempting to implement sequential IDs for the vertices in our AWS Neptune graph. So far, we have added a new property called vertexNumber, which will store the numeric sequential ID for each vertex. Then, before saving a vertex to the database, I run a simple query to retrieve the current highest vertex number, increment it, and store the new vertexNumber to that vertex. Pseudo-code examples found below.
// Calculate the current highest vertexNumber
id = g.V().hasLabel('my_vertex_label').has("vertexNumber").values("vertexNumber").max()

// Increment result by 1 which will be for the next vertex we save.
vertexNumber = id + 1

// Add the new vertex
g.addV('my_vertex_label').property(Cardinality.single, "vertexNumber", vertexNumber).property(...etc)
// Calculate the current highest vertexNumber
id = g.V().hasLabel('my_vertex_label').has("vertexNumber").values("vertexNumber").max()

// Increment result by 1 which will be for the next vertex we save.
vertexNumber = id + 1

// Add the new vertex
g.addV('my_vertex_label').property(Cardinality.single, "vertexNumber", vertexNumber).property(...etc)
My question is: How will Neptune handle this at scale? For context, we have a distributed architechture in which tens or hundreds (in super rare cases, maybe over a thousand?) new vertexes can be created per SECOND, meaning our db cluster probably sees a lot of concurrent transactions. We are looking for information on how Neptune will handle the initial read query with, for example, 10 or more concurrent transactions. Will all 10+ transactions return the same vertexNumber, or Will Neptune be smart enough to isolate these queries? Thanks!
16 replies