Optimizing connection between Python API (FastAPI) and Neptune

Hi guys. I've been working with gremlin python in my company for the past 4 years, using Neptune as the database. We are running a FastAPI server, where Neptune has been the main database since the beginning. We always have been struggling to get a good performance on the API, but recently it has become a more latent pain, with endpoints taking more than 10s to respond. We took some actions trying to improve this perfomance, such as updating the cluster to the latest engine version, and the same for FastAPI and gremlin-python dependencies. Right now we're running with 3 instances (2 read replicas) db.t4g.medium. We also tested with a single db.r6g.large, but we didn't see a significant improvement. In the process of trying to understand more what's causing the slowness, we've created a proof of concept API, where the source code can be found on this repo: https://github.com/aca-so/neptune-poc/. We also created a new connector to Neptune, different of what we use in our main application, ‘cause on our main application we have a mechanism of keep alive to avoid Neptune closing the connections. For this PoC we used a different approach, recycling the connections every 5 minutes, based on the instances available on cluster. So the first question is: 1. ”What's the best way to handle these connections? We thought in three approaches: keep alive (we know it doesn't fits good with connection pool), using until closed and then renew, or renewing every X minutes. Is there another way? What's the best one?”
Solution:
There's a lot to unpack here.... 1. We state in our docs that t4g.medium instances are really not great for production workloads. We support them for initial development so users can keep cost down, but the amount of resources available, and the fact that they are burstable instances, really constrains their usability. Once you've used up CPU credits, you're going to get throttled. 2. Neptune's concurrency model is based on instance size and the number of vCPUs per instance. For each vCPU, there are two query execution threads. So on a t4g.medium or an r6g.large instance, there are 2 vCPUs. That means that instance can only be computing 4 concurrent requests at a time. If you need more concurrency, then you should look to scale to a larger instance with more vCPUs. If you're workload varies over time, you may want to investigate using Neptune Serverless, which can automatically scale vertically to meet the needs of the application. There's a good presentation from last year's re:Invent that discusses when Serverless works best and when not to use it: https://youtu.be/xAdWa0Ahiok?si=OeSe-_L3ErcYH-XU...
AWS Events
YouTube
AWS re:Invent 2023 - Amazon Neptune architectures for scale, availa...
What if you could bring the connected data insights of your Amazon Neptune application to all your users—graph practitioners and non-technical users alike—while evolving your application to satisfy increasingly demanding availability, performance, and scaling requirements? In this session, review architectures that can help you operate and evolv...
Jump to solution
5 Replies
Vitor Martins
Vitor Martins2mo ago
Moving on, in some tests we saw that the slowness increase when dealing with concurrent requests/queries. So we decided to do some “load tests” using ab with different scenarios. The application was running in a k8s cluster with a single pod in the same region of the Neptune. We experiment across three different characteristics: synchronicity, connection pool and concurrent requests. For synchronicity we tried with: - Synchronous queries: https://github.com/aca-so/neptune-poc/tree/sync-routes - Asynchronous queries using asyncio.to_thread: https://github.com/aca-so/neptune-poc - Asynchronous queries using asyncio.wrap_future: https://github.com/aca-so/neptune-poc/tree/async-future For connection pool we tried: - Without connection pool (Pool size = 1) - Tests with pool size of 5 - Tests with pool size of 10 For concurrent requests we run: - Without concurrent requests - 5 concurrent requests - 10 concurrent requests - 20 concurrent requests - 50 concurrent requests - 100 concurrent requests My second question is: 2. "What's the best way to implement asynchronous queries?" The query we were doing in the tests is a group count by label: g.V().groupCount().by(__.label()).next() Running it locally (connected to Neptune through VPN) takes about 1030ms to run, and running it using the profile step from gremlin gives:
{
"dur": 439596107,
"metrics": [
{
"id": "0.0.0()",
"name": "NeptuneGraphQueryStep(Vertex)",
"dur": 439547328,
"counts": {"traverserCount": 1, "elementCount": 1},
"annotations": {"percentDur": 99.98890367789357},
"metrics": [],
},
{
"id": "2.0.0()",
"name": "NeptuneTraverserConverterStep",
"dur": 48779,
"counts": {"traverserCount": 1, "elementCount": 1},
"annotations": {"percentDur": 0.011096322106419382},
"metrics": [],
},
],
}
{
"dur": 439596107,
"metrics": [
{
"id": "0.0.0()",
"name": "NeptuneGraphQueryStep(Vertex)",
"dur": 439547328,
"counts": {"traverserCount": 1, "elementCount": 1},
"annotations": {"percentDur": 99.98890367789357},
"metrics": [],
},
{
"id": "2.0.0()",
"name": "NeptuneTraverserConverterStep",
"dur": 48779,
"counts": {"traverserCount": 1, "elementCount": 1},
"annotations": {"percentDur": 0.011096322106419382},
"metrics": [],
},
],
}
Here's the shape of the response, that represents a bit of our graph:
{
"A": 4993,
"B": 5570,
"C": 310,
"D": 121,
"E": 83,
"F": 3,
"G": 48,
"H": 9,
"I": 175,
"J": 28,
"K": 3125,
"L": 8,
"M": 16608,
"N": 48070,
"O": 459,
"P": 7667,
"Q": 892,
"R": 4460,
"S": 208,
"T": 27,
"U": 10,
"V": 20,
"W": 1197,
"X": 8167,
"Y": 226,
"Z": 9,
"AA": 145,
"AB": 654,
"AC": 9379,
"AD": 117,
"AE": 11148,
"AF": 40,
"AG": 6314,
"AH": 4,
"AI": 4378,
"AJ": 68,
"AK": 1,
"AL": 112,
"AM": 10,
"AN": 5,
"AO": 3,
"AP": 11443,
"AQ": 25843,
"AR": 198
}
{
"A": 4993,
"B": 5570,
"C": 310,
"D": 121,
"E": 83,
"F": 3,
"G": 48,
"H": 9,
"I": 175,
"J": 28,
"K": 3125,
"L": 8,
"M": 16608,
"N": 48070,
"O": 459,
"P": 7667,
"Q": 892,
"R": 4460,
"S": 208,
"T": 27,
"U": 10,
"V": 20,
"W": 1197,
"X": 8167,
"Y": 226,
"Z": 9,
"AA": 145,
"AB": 654,
"AC": 9379,
"AD": 117,
"AE": 11148,
"AF": 40,
"AG": 6314,
"AH": 4,
"AI": 4378,
"AJ": 68,
"AK": 1,
"AL": 112,
"AM": 10,
"AN": 5,
"AO": 3,
"AP": 11443,
"AQ": 25843,
"AR": 198
}
The results we achieved can be checked on this spreadsheet: https://docs.google.com/spreadsheets/d/1lTd0ke4zhhWZ-LkpgkD2-1lqY01G8HbSo9DbJhKfHf4/edit?gid=0#gid=0 There are a few comments that I want to mention about these results: - Using a connection pool greater than 1 sometimes make the queries fail with MemoryLimitExceededException. - Running with asyncio.wrap_future in some tests caused all requests to fail, without logs on the application. In these cases the tests were executed again. There were cases where only a few requests failed, but also without logs. And finally, my third and last question (for now 😅) is: 3. “What can we do to improve our performance when dealing with concurrent requests?”
Solution
triggan
triggan2mo ago
There's a lot to unpack here.... 1. We state in our docs that t4g.medium instances are really not great for production workloads. We support them for initial development so users can keep cost down, but the amount of resources available, and the fact that they are burstable instances, really constrains their usability. Once you've used up CPU credits, you're going to get throttled. 2. Neptune's concurrency model is based on instance size and the number of vCPUs per instance. For each vCPU, there are two query execution threads. So on a t4g.medium or an r6g.large instance, there are 2 vCPUs. That means that instance can only be computing 4 concurrent requests at a time. If you need more concurrency, then you should look to scale to a larger instance with more vCPUs. If you're workload varies over time, you may want to investigate using Neptune Serverless, which can automatically scale vertically to meet the needs of the application. There's a good presentation from last year's re:Invent that discusses when Serverless works best and when not to use it: https://youtu.be/xAdWa0Ahiok?si=OeSe-_L3ErcYH-XU Similar to this, connection pool size would likely mirror the number of available execution threads on the instance(s). You can send more requests to a Neptune instance over what it can currently process, but those additional requests will be queued (up to ~8,000 requests can end up in the request queue, which you can monitor via the MainQueuePendingQueueReqeusts CloudWatch metric).
AWS Events
YouTube
AWS re:Invent 2023 - Amazon Neptune architectures for scale, availa...
What if you could bring the connected data insights of your Amazon Neptune application to all your users—graph practitioners and non-technical users alike—while evolving your application to satisfy increasingly demanding availability, performance, and scaling requirements? In this session, review architectures that can help you operate and evolv...
triggan
triggan2mo ago
3. Websockets are great if you have a workload that is constantly sending requests or if you're taking advantage of streaming data back to a client. However, they do come with the overhead of needing to maintain these connections and handle reconnects when these connections die. Generally these connections are long-lived in Neptune, except in the case when they are idle (>20-25 minutes) or in the event that you're using IAM Authentication (we terminate any connection older than 10 days regardless of idle or not). If your workload isn’t taking advantage of websockets, you may want to consider moving to basic http requests instead. Neptune now has a NeptuneData API as part of the boto3 SDK that you can use to send requests to a Neptune endpoint as an http request without the need of a live websocket connection. https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/neptunedata.html

 4. Count queries in Neptune are full scans. So any time you need to do a groupCount().by(label) for all vertices, this is going to scan the entire database. If you just want a count of nodes or nodes with a certain property, there’s also the Summary API that you can use for this: https://docs.aws.amazon.com/neptune/latest/userguide/neptune-graph-summary.html
Getting a quick summary report about your graph - Amazon Neptune
The Neptune graph summary API retrieves the following information about your graph:
Vitor Martins
Vitor Martins2mo ago
Hey @triggan , thanks a lot for your response, that's also a lot for us to study, I'm really happy to dive more into this! 😄 We've seen that t3 and t4g are not recommended for production, that's why we tried with r6g.large at some point, but it'll be good for us to make these tests with this instance to compare with the t4g, and also with serverless. We'll do it and then we can return with the results. Good to know about the connection pool size, it's good to have a solid base to this. For now I think websockets fits better for our purpose, but it's also a good thing to know and maybe try http at some point. The count query was just an example that we thought as an average of the queries we do, not so simple but not too complex either. The idea was to fast experiment the query without having to code a query of our workloads on another application without our DSL. Maybe it can be a thing that we can explore better in future tests. But again, a lot of things for us to study and try, thanks for your help and I'd like to keep this thread open so we can go further with this discussion.
triggan
triggan2mo ago
Sounds good. We're more than happy to help. We can keep the thread open. Also happy to jump on a call and talk things through, as needed.
Want results from more Discord servers?
Add your server