How to Work with Transactions with Gremlin Python
I`m trying to implement transactions but I have two scenarios.
I start a transaction but when I use iterate on every add_v it saves on my gremlin_server before the commit. The second situation is if if take out the .iterate() and run a commit() it doenst save on gremlin-server. What am I doing wrong?
6 Replies
Hi, for remote gremlin queries (using transaction or not), terminal steps have to be added in order for the traversal to be submitted to the server. So yes,
iterate()
will have to be included into each line, or else the driver is only constructing the queries but not sending them to the server. Another one that can be used is next()
.
For more information and a list of all terminal steps, see docs here.Thanks for the response @Yang Xia, sorry for my lack of knowledge but what is the purpose of opening a transaction and commit its transaction if the iterate() send to the server already? Also the .rollback() doenst work if I try to use it after the iterate() on gremlin-server. I received this suggestion to use tranctions to try to have more performance than using query string, thats why I`m try to implement it and check the difference in performance.
Have to say I'm not the expert on transactions, but all transactional sessions are handled by the server side and not client side, so all traversals have to be submitted.
To discard your changes, are you calling
tx.rollback()
before or after tx.commit()
?
The definition of rollback
is that the server will roll back all the changes since the last commit
, so what happens is when you iterate
a traversal, it will be sent to the server and be processed. If you decide to keep the changes, you call tx.commit()
, if you want to discard the changes you call tx.rollback()
, which will discard up to the last tx.commit()
call.
Also to confirm, I see NEPTUNE
inside your debug, are you using a local Gremlin Server or a Neptune database?I received this suggestion to use tranctions to try to have more performance than using query string, thats why I`m try to implement it and check the difference in performance.Unsure where this is coming from. What sort of performance gain are you looking for?
If you're using Gremlin Server, what backing store are you using? TinkerGraph? If so, ensure you're using TinkerTransactionGraph.
There's more on how to use TinkerTransactionGraph for unit testing of transactions here: https://aws.amazon.com/blogs/database/unit-testing-apache-tinkerpop-transactions-from-tinkergraph-to-amazon-neptune/
Amazon Web Services
Unit testing Apache TinkerPop transactions: From TinkerGraph to Ama...
In this post, I build upon the approach of the previous post and show how you can use TinkerGraph to unit test your transactional workloads. Additionally, I show how to use TinkerGraph in embedded mode. Embedded mode requires the use of Java, but it simplifies the test environment considerably as there is no need to run the server as a separate ...
I tried to call tx.rollback() after I did the mergeV inserts just to check if it works. It doenst rolled back the data inserted. But I discarded the option of using transactions because the performance was similar than using the chained mergeV() command.
I`m spending 150ms to insert some vertex and edges so I was looking for a better way to do that instead of using the method submit from client.Client with chained mergeV and mergeE (two requests to not have conflict between edges) But I didnt see any difference using transactions. Those tests were made on AWS Neptune with an api made in python with fastapi freamework. Local is difficult to test the performance because the response time is really different from the server.
If you're looking to optimize for write throughput on Neptune, you want to consider the following:
- For each write requests, attempt to batch 100-200 "object" into a single write request/query. An "object" would be any combination of a vertex, edge, or subsequent vertex/edge properties (vertex with 4 properties == 5 "objects").
- Use parallel write requests. If using Python, consider using
multiprocessing
to create separate processes. They can share a connection pool to Neptune if you so choose. The number of parallel processes should equal the number of query execution threads available on your Neptune writer instance (which is equal to 2x the number of vCPUs on whatever size instance you're using).
If you follow those guidelines, you should get similar performance to what you would see with Neptune's bulk loader. Note that conditional writes will have overhead. If using mergeV()
, you're unlikely to see the same write throughput as Neptune's bulk loader as the bulk loader is not doing conditional writes.
Neptune's "top speed" for write throughput is going to be about 120,000 "objects" per second when writing vertex and vertex properties and about half of that when writing edges (due to vertex reference checks when creating an edge). These numbers can only be attained if using a x.12xlarge writer instance or larger. Smaller instances will scale linearly in terms of throughput.
You may see write throughput exceed 120,000 in some cases. There are a number of dependencies that drive that. But that's the safe number to use when estimating load speed/rates.