May I suggest a new topic-channel for us? Like "really-big-data" or "pagination"?
Related to https://discord.com/channels/838910279550238720/1100527694342520963/1100853192922759244 and having read the recommended links on how to paginate the end of a query, I am wondering about how to manage large sets of traversals and large side-effect-collected collections which a query might be encountering or constructing as the graph is visited when the paths offer relatively large datasets after having been wisely filtered. For example, what is advised if one really does need to group by first-name all the followers of Taylor Swift (i.e. some exemplary uber-set) and wants to bag that for a later phase of a query which isn't the final collection that will be consumed by some external REST client? Yes, the final collect step can be easily paginated as advised - but what about all that earlier processing?
What should we be thinking when we anticipate having 500,000, or 10x this, traversals heading into a group by - by - bye! barrier / collecting stage?
Other than, "Punt" or "Run away!"?
Solution:Jump to solution
There is not a feature in Gremlin directly that will directly handle this for you automatically but the drivers do let you stream back results instead of collecting them all at once which can help mitigate transferring large result sets. If you are using Amazon Neptune it also has a query results cache to assist with paging: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-results-cache.html#gremlin-results-cache-paginating
Caching query results in Amazon Neptune Gremlin - Amazon Neptune
Overview of using the query results cache with Gremlin.
5 Replies
I am not sure I exactly understand your question. Let me try to rephrase it to see if I understand. You have a query where you want to paginate in the middle of the query after which you want to continue the traversal? (e.g. find me the first 10 followers of Taylor swift (ordered by name) and then find me their friends and group by the common friends). You then want to paginate over the middle portion of the step (i.e. find the first 10 friends, then a second call to find friends 10-20, then 20-30, etc.)
Is that understanding correct?
It's more like: I want to offer an analytical response that is going to consist of itemized elements - it's a set. That set may be large. So it needs to be consumed by a client that will ask for slices of the response. Internally, in order to find the patterns and to gather intermediate data, the query is going to encounter internal collections and traversal sets that are going to be larger than can be processed in memory.
Are there ways (gremlin steps) to guide the query processor on how to handle these situations versus simply throwing an exception when a time limit is exceeded or when a traversal count limit is exceeded?
I know that there is a lot which can be done to minimize the size of the response items - such as projecting out just the properties that are necessary versus providing the full state of an Edge or Vertex - but when our system comes to have the connection graph our users expect, there are going to be situations where an individual "step" has the responsibility to process a truly large dataset. It is not unrealistic to estimate that the set of downsampled items in the intermediate phases could be 500K to 1M.
It's not that we "want to paginate" early in the queries, but rather that we wonder what tactics are used to solve these kinds of situations. The only "want" is at the very end: client-users are going to want to paginate the output response when either it is itself a large set or it is a very large single string and it exceeds the string buffer limits if requested all at once.
Solution
There is not a feature in Gremlin directly that will directly handle this for you automatically but the drivers do let you stream back results instead of collecting them all at once which can help mitigate transferring large result sets. If you are using Amazon Neptune it also has a query results cache to assist with paging: https://docs.aws.amazon.com/neptune/latest/userguide/gremlin-results-cache.html#gremlin-results-cache-paginating
Caching query results in Amazon Neptune Gremlin - Amazon Neptune
Overview of using the query results cache with Gremlin.
That being said if you are having to traverse through 100ks to millions of edges in a single traversal that is going to take a significant amount of time and server memory so there are other issues surrounding that which I would expect you would run into.
Is this system expected to serve mostly transactional or analytic traffic?
Thank you for the reference. To answer your last question, the system stores relationships between items that represent transactional artifacts stored elsewhere. The large dataset queries would be for analyses which are asking for a connectivity graph to know which artifacts participate in some study - such as a Size Weight and Power rollup over a vast collection of engineering artifacts.