MergeV "get or create" performance asymmetry
So I'm working on adding the mergeV step among others to the Rust gremlin driver. As part of that I took a pause and did a performance comparison to the "traditional" way of doing it.
So the Rust driver is submitting bytecode that's effectively doing:
"Traditional/Reference":
^ But given a batch of 10k vertices to write it'd do this for a chunk of 10 vertices in a single mutation traversal, but doing 10 connections in parallel to split up the batch until it finished getting all 10k written. It's well known that very long traversals don't perform well and my own trials found that doing this at > 50 vertices in a single traversal would cause timeouts for my use case, so I've been generally doing 10 and calling it good.
But this puts a ceiling on the amount of work a single network call can make (10 vertices worth) so hence why I started trying out
mergeV()
to stack more info into a single call without making the traversal prohibitively long.
And then the "mergeV()" way:
I would run the mergeV call with chunks of 200 vertices in each call.1 Reply
Doing my trials (Cassandra & ES running locally via docker compose, also running JG locally in said docker compose enviornment) I was seeing 2-4x improvement of writes to the graph when the vertices were all novel ids (Reference & MergeV trials would generate distinct datasets to write for each trial):
But then figured I should try the "get" side of the "get or create" and I was rather surprised that mergeV seemed to be significantly slower than the "traditional" way of doing it:
"MergeV redo" is the writing the same vertices again from the inital MergeV trial.
The "(All read, dataset swap)" line is running the Reference and MergeV logic again, but with the other's dataset.
I guess technically mergeV is having to lookup 200 vertices per network call whereas Reference's chunk size is only 10, but figured I'd post the question in case this seemed weird to any JG core devs
Reran my trial with both set to chunk sizes of 10 of the 10k batch (both still had 10 parallel connections allowed). So this reduced the MergeV chunk (what it'd inject into the traversal) down from 200 to 10, but figured that'd make it more comparable on the lookup side. MergeV got way worse 🤔