Testing new Neptune version
Hey all,
We're currently in Neptune 1.1.0.0. To prepare for the next Neptune version, 1.2.0.2, we made a pretty big test that covers up most of what is happening that utilizes Neptune.
Doing extensive testing in our developer environment which doesn't have Neptune, we instead used a local Tinkerpop server of version 3.5.4, which according to the notes is the latest version utilized by Neptune.
After a month of testing and making sure everything works as intended, we went ahead and upgraded our Staging environment from 1.1.0.0 to 1.2.0.2. However, we quickly started seeing issues, and had to roll back ASAP to prevent issues for our developers.
Is testing via a local Tinkerpop server not good enough to gain confidence? What can we do to have a good understanding of the differences between Neptune and a Tinkerpop server? Should we make a developer-environment based Neptune server to be used by the devs testing it?
Please advise.
20 Replies
After investigation for a bit I've found that there is an issue regarding unions. So far I've only found a single query to be affected by this
-- FULL QUERY, TWO UNIONS --
-- SAME QUERY SPLIT TO TWO, ONE UNION LEFT --
-- QUERY SPLIT TO THREE, NO UNIONS --
This is the same query, but instead of having a single query run, I ran the queries separately. Same thing, but without unions.
Only the last one (no unions) worked, the others look like they work (no errors) but they also don't have any effect for some reason over the DB.
To be clear - query is 4-5 years old by now and is part of a main flow - we know it worked for a long time prior to the upgrade.
update: we upgraded to 1.2.1.0, since the patch notes show 2 UnionStep bug fixes. However in this version the issue persists.
Happy to look into this. Are you able to share the Neptune query profiles for the two non working ones. Just to clarify, you are seeing no errors?
Also, are those the exact queries? I ask as this is not valid Gremlin grammar:
The dot before the
has
specificallyThank you for your response. Let me answer one by one:
Are you able to share the Neptune query profiles for the two non working ones.Not sure what a query profile is - can you elaborate?
Just to clarify, you are seeing no errors?Correct. The response returns successfully. Both by code and by running this via a CLI client we set up that sends commands to the server. It is as if it ran an empty response.
Also, are those the exact queries? I ask as this is not valid Gremlin grammar:Oops, editing mistake - the query is valid, it's the copy paste here that had the improper values. It did scream at me for a wrong syntax a few times and I fixed it. Oh, my notepad is removing those
__
because of markdown. I'll fix the query, just a few minutes.
Fixed.Neptune has a
/profile
API endpoint that will run the query and return the query plan along with a lot of runtime statistics. The profiles really help diagnose query issues. You can get a profile either by sending the query as a text string to the cluster:8182/gremlin/profile
endpoint, or, easier if you have it setup, from a notebook using %%gremlin profile
Gremlin profile API in Neptune - Amazon Neptune
The Gremlin profile feature in Neptune runs a specified traversal, collects metrics about the run, and produces a profile report.
Thank you. This is probably not simple to do with my permissions - I will have to talk to DevOps and ask them to do this. As it is the end of the week for me (in Israel we work Sunday to Thursday), it will have to wait to Sunday next week. I will share it as soon as I can.
In the meantime, can you please check if there are known issues with UnionStep at the latest version? I tried finding bug trackers or anything that could hint at what known issues there are to identify the issue.
As a side note: we have a lot of
union
calls and they all work fine. The only big thing I can see is that those unions are being done on vertices and they work, while this one is pure edges. Maybe it has something do with it?I will see what I can find out.
Most appreciated. I will supply the profile on Sunday.
It probably also makes sense to open a support case for this if you are able to do so. I tried to reproduce what you are seeing on 1.2.1.0 using a small test graph I built and the query seemed to work. If you do open a case please DM me the case number.
Will do. Thank you for doing this test.
In case it helps, this is the simple test setup I used. If by chance you can see a way to tweak it that shows the potential bug that would be great.
Looks good, I wonder if you need any non-change edge as well. We have a lot of edges that are also affected. They have a similar structure to change, with tenantId and session properties
I tried with and without a
change
edge and still got good results.Hey, some update: I've been struggling for a while to use the profiler. For some reason I'm always getting a failure message related to the profile serializer. I've tried multiple serializers, all of which give me the same error:
(for the sake of censure, I changed my endpoint to
my-neptune-endpoint
).
Additionally, I've opened a ticket to AWS, which includes most of the information shown here. DM'd you the ticket ID.
I will try again tomorrow, I might have been doing a silly mistake somewhere.If you have a notebook configured, you an just use
%%gremlin profile
in the cell header. If IAM is enabled you will also need to set that using %%graph_notebook_config
Regarding awscurl
this worked for me on a non-IAM cluster, note the use of -H
Man, I've been trying for like 30 minutes to get it to work, and realized that python is splitting the
accept: application/json
with a space after :
(I didn't have a space).
Seems like it works now though, thanks!
Hey, now that profile works for me, I managed to reproduce the issue. What I did was make it so all change
edges that should be affected would be set to new
in their state.
After running the full query, it ran successfully. After checking, the change
edges are still new
rather than published
.Here is the profile, had to send it like this as it's too big for Discord posts.
Were you by chance able to reproduce what you are seeing with my sample graph (or a modified form of it)? Just looking for any kind of simple reproducer that will help figure this out.
Haven't tried yet, I've been hanging with AWS reps all day (via the ticket and a chime meeting). I might try later today.