Apache TinkerPop•3y ago

Testing new Neptune version

Hey all, We're currently in Neptune 1.1.0.0. To prepare for the next Neptune version, 1.2.0.2, we made a pretty big test that covers up most of what is happening that utilizes Neptune. Doing extensive testing in our developer environment which doesn't have Neptune, we instead used a local Tinkerpop server of version 3.5.4, which according to the notes is the latest version utilized by Neptune. After a month of testing and making sure everything works as intended, we went ahead and upgraded our Staging environment from 1.1.0.0 to 1.2.0.2. However, we quickly started seeing issues, and had to roll back ASAP to prevent issues for our developers. Is testing via a local Tinkerpop server not good enough to gain confidence? What can we do to have a good understanding of the differences between Neptune and a Tinkerpop server? Should we make a developer-environment based Neptune server to be used by the devs testing it? Please advise.

20 Replies

ShushOP•3y ago

After investigation for a bit I've found that there is an issue regarding unions. So far I've only found a single query to be affected by this -- FULL QUERY, TWO UNIONS --

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').union(__.has('session', 'd4c3ced6-0883-55dd-890f-cdb13060df47').union(__.has('toSession').property('from', 1681977930197).coalesce(__.has('toSession', 8640000000000000).property('to', 8640000000000000).property('toSession', 1681977930197), __.property('to', 1681977930197)), __.hasLabel('change').property('state', 'published')), __.has('invalidateInSession', 'd4c3ced6-0883-55dd-890f-cdb13060df47').property('to', 1681977930197))

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').union(__.has('session', 'd4c3ced6-0883-55dd-890f-cdb13060df47').union(__.has('toSession').property('from', 1681977930197).coalesce(__.has('toSession', 8640000000000000).property('to', 8640000000000000).property('toSession', 1681977930197), __.property('to', 1681977930197)), __.hasLabel('change').property('state', 'published')), __.has('invalidateInSession', 'd4c3ced6-0883-55dd-890f-cdb13060df47').property('to', 1681977930197))

-- SAME QUERY SPLIT TO TWO, ONE UNION LEFT --

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').has('session', 'd4c3ced6-0883-55dd-890f-cdb13060df47').union(__.has('toSession').property('from', 1681977930197).coalesce(__.has('toSession', 8640000000000000).property('to', 8640000000000000).property('toSession', 1681977930197), __.property('to', 1681977930197)), __.hasLabel('change').property('state', 'published'))

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').has('invalidateInSession', 'd4c3ced6-0883-55dd-890f-cdb13060df47').property('to', 1681977930197)

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').has('session', 'd4c3ced6-0883-55dd-890f-cdb13060df47').union(__.has('toSession').property('from', 1681977930197).coalesce(__.has('toSession', 8640000000000000).property('to', 8640000000000000).property('toSession', 1681977930197), __.property('to', 1681977930197)), __.hasLabel('change').property('state', 'published'))

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').has('invalidateInSession', 'd4c3ced6-0883-55dd-890f-cdb13060df47').property('to', 1681977930197)

-- QUERY SPLIT TO THREE, NO UNIONS --

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').has('session', 'd4c3ced6-0883-55dd-890f-cdb13060df47').has('toSession').property('from', 1681977930197).coalesce(.has('toSession', 8640000000000000).property('to', 8640000000000000).property('toSession', 1681977930197), .property('to', 1681977930197))

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').has('session', 'd4c3ced6-0883-55dd-890f-cdb13060df47').hasLabel('change').property('state', 'published')

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').has('invalidateInSession', 'd4c3ced6-0883-55dd-890f-cdb13060df47').property('to', 1681977930197)

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').has('session', 'd4c3ced6-0883-55dd-890f-cdb13060df47').has('toSession').property('from', 1681977930197).coalesce(.has('toSession', 8640000000000000).property('to', 8640000000000000).property('toSession', 1681977930197), .property('to', 1681977930197))

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').has('session', 'd4c3ced6-0883-55dd-890f-cdb13060df47').hasLabel('change').property('state', 'published')

g.E().has('tenantId', 'aeab812e-a8f1-447a-980f-2e2dc3e86eea').has('invalidateInSession', 'd4c3ced6-0883-55dd-890f-cdb13060df47').property('to', 1681977930197)

This is the same query, but instead of having a single query run, I ran the queries separately. Same thing, but without unions. Only the last one (no unions) worked, the others look like they work (no errors) but they also don't have any effect for some reason over the DB. To be clear - query is 4-5 years old by now and is part of a main flow - we know it worked for a long time prior to the upgrade. update: we upgraded to 1.2.1.0, since the patch notes show 2 UnionStep bug fixes. However in this version the issue persists.

kelvinl2816•3y ago

Happy to look into this. Are you able to share the Neptune query profiles for the two non working ones. Just to clarify, you are seeing no errors? Also, are those the exact queries? I ask as this is not valid Gremlin grammar:

union(.has('toSession')

union(.has('toSession')

The dot before the has specifically

ShushOP•3y ago

Thank you for your response. Let me answer one by one:

Are you able to share the Neptune query profiles for the two non working ones.

Not sure what a query profile is - can you elaborate?

Just to clarify, you are seeing no errors?

Correct. The response returns successfully. Both by code and by running this via a CLI client we set up that sends commands to the server. It is as if it ran an empty response.

Also, are those the exact queries? I ask as this is not valid Gremlin grammar:

Oops, editing mistake - the query is valid, it's the copy paste here that had the improper values. It did scream at me for a wrong syntax a few times and I fixed it. Oh, my notepad is removing those __ because of markdown. I'll fix the query, just a few minutes. Fixed.

kelvinl2816•3y ago

Neptune has a /profile API endpoint that will run the query and return the query plan along with a lot of runtime statistics. The profiles really help diagnose query issues. You can get a profile either by sending the query as a text string to the cluster:8182/gremlin/profile endpoint, or, easier if you have it setup, from a notebook using %%gremlin profile

kelvinl2816•3y ago

For reference https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-profile-api.html

Gremlin profile API in Neptune - Amazon Neptune

The Gremlin profile feature in Neptune runs a specified traversal, collects metrics about the run, and produces a profile report.

ShushOP•3y ago

Thank you. This is probably not simple to do with my permissions - I will have to talk to DevOps and ask them to do this. As it is the end of the week for me (in Israel we work Sunday to Thursday), it will have to wait to Sunday next week. I will share it as soon as I can. In the meantime, can you please check if there are known issues with UnionStep at the latest version? I tried finding bug trackers or anything that could hint at what known issues there are to identify the issue. As a side note: we have a lot of union calls and they all work fine. The only big thing I can see is that those unions are being done on vertices and they work, while this one is pure edges. Maybe it has something do with it?

kelvinl2816•3y ago

I will see what I can find out.

ShushOP•3y ago

Most appreciated. I will supply the profile on Sunday.

kelvinl2816•3y ago

It probably also makes sense to open a support case for this if you are able to do so. I tried to reproduce what you are seeing on 1.2.1.0 using a small test graph I built and the query seemed to work. If you do open a case please DM me the case number.

ShushOP•3y ago

Will do. Thank you for doing this test.

kelvinl2816•3y ago

In case it helps, this is the simple test setup I used. If by chance you can see a way to tweak it that shows the potential bug that would be great.

g.addV('test').as('v1').property(id,'v1').
  addV('test').as('v2').property(id,'v2').
  addE('change').from('v1').to('v2').
    property(id,'e1').
    property('tenantId','aeab812e-a8f1-447a-980f-2e2dc3e86eea').
    property('session','d4c3ced6-0883-55dd-890f-cdb13060df47').
    property('toSession', true)

g.addV('test').as('v1').property(id,'v1').
  addV('test').as('v2').property(id,'v2').
  addE('change').from('v1').to('v2').
    property(id,'e1').
    property('tenantId','aeab812e-a8f1-447a-980f-2e2dc3e86eea').
    property('session','d4c3ced6-0883-55dd-890f-cdb13060df47').
    property('toSession', true)

ShushOP•3y ago

Looks good, I wonder if you need any non-change edge as well. We have a lot of edges that are also affected. They have a similar structure to change, with tenantId and session properties

kelvinl2816•3y ago

I tried with and without a change edge and still got good results.

ShushOP•2y ago

Hey, some update: I've been struggling for a while to use the profiler. For some reason I'm always getting a failure message related to the profile serializer. I've tried multiple serializers, all of which give me the same error:

bash-5.0# awscurl -X POST my-neptune-endpoint:8182/gremlin/profile -d '{"gremlin":"g.V().count()", "profile.serializer":"application/json"}' --region eu-west-1 --service neptune-db
{"requestId":"03e08a2a-f5e3-4881-91a4-6de26432a3a5","code":"UnsupportedOperationException","detailedMessage":"no serializer for requested Accept header: application/xml"}
Traceback (most recent call last):
  File "/usr/bin/awscurl", line 8, in <module>
    sys.exit(main())
  File "/usr/lib/python3.8/site-packages/awscurl/awscurl.py", line 521, in main
    inner_main(sys.argv[1:])
  File "/usr/lib/python3.8/site-packages/awscurl/awscurl.py", line 515, in inner_main
    response.raise_for_status()
  File "/usr/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: my-neptune-endpoint:8182/gremlin/profile

bash-5.0# awscurl -X POST my-neptune-endpoint:8182/gremlin/profile -d '{"gremlin":"g.V().count()", "profile.serializer":"application/json"}' --region eu-west-1 --service neptune-db
{"requestId":"03e08a2a-f5e3-4881-91a4-6de26432a3a5","code":"UnsupportedOperationException","detailedMessage":"no serializer for requested Accept header: application/xml"}
Traceback (most recent call last):
  File "/usr/bin/awscurl", line 8, in <module>
    sys.exit(main())
  File "/usr/lib/python3.8/site-packages/awscurl/awscurl.py", line 521, in main
    inner_main(sys.argv[1:])
  File "/usr/lib/python3.8/site-packages/awscurl/awscurl.py", line 515, in inner_main
    response.raise_for_status()
  File "/usr/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: my-neptune-endpoint:8182/gremlin/profile

(for the sake of censure, I changed my endpoint to my-neptune-endpoint). Additionally, I've opened a ticket to AWS, which includes most of the information shown here. DM'd you the ticket ID. I will try again tomorrow, I might have been doing a silly mistake somewhere.

kelvinl2816•2y ago

If you have a notebook configured, you an just use %%gremlin profile in the cell header. If IAM is enabled you will also need to set that using %%graph_notebook_config Regarding awscurl this worked for me on a non-IAM cluster, note the use of -H

awscurl -XPOST -H "accept: application/json" https://<my-endpoint>.us-east-1.neptune.amazonaws.com:8182/gremlin/profile \
    -d '{"gremlin":"g.V().count()"}'

awscurl -XPOST -H "accept: application/json" https://<my-endpoint>.us-east-1.neptune.amazonaws.com:8182/gremlin/profile \
    -d '{"gremlin":"g.V().count()"}'

ShushOP•2y ago

Man, I've been trying for like 30 minutes to get it to work, and realized that python is splitting the accept: application/json with a space after : (I didn't have a space). Seems like it works now though, thanks! Hey, now that profile works for me, I managed to reproduce the issue. What I did was make it so all change edges that should be affected would be set to new in their state. After running the full query, it ran successfully. After checking, the change edges are still new rather than published.

ShushOP•2y ago

gremlin_profile.txt

ShushOP•2y ago

Here is the profile, had to send it like this as it's too big for Discord posts.

kelvinl2816•2y ago

Were you by chance able to reproduce what you are seeing with my sample graph (or a modified form of it)? Just looking for any kind of simple reproducer that will help figure this out.

ShushOP•2y ago

Haven't tried yet, I've been hanging with AWS reps all day (via the ticket and a chime meeting). I might try later today.

Gaming

Programming

Testing new Neptune version

Did you find this page helpful?