Iterating over responses

I've got a query akin to this in a python application using gremlin-python:
t = traversal().with_remote(DriverRemoteConnection(url, "g")) \
.V().has("some_vertex", "some_property", "foo").values()
print(t.next())
t = traversal().with_remote(DriverRemoteConnection(url, "g")) \
.V().has("some_vertex", "some_property", "foo").values()
print(t.next())
some_property is indexed by a mixed index served by ElasticSearch behind JanusGraph, with at least for the moment about 1 million entries. I'm still building up my dataset so foo will actually return about 100k of the million, but future additions will change that. If I do the query as written above it times out, presumably it's trying to send back all 100k at once? If I do a limit of like 100, it seems like I get all 100 at once (of course changing t.next() to instead being a for loop to observe all 100). In the Tinkerpop docs there's mention of a server side setting resultIterationBatchSize with a default of 64. I'd expect it to just send back the first part of the result set as a batch of 64, and I only print 1 of them, discarding the rest. The Gremlin-Python section explicitly calls out a client side batch setting:
The following options are allowed on a per-request basis in this fashion: batchSize, requestId, userAgent and evaluationTimeout (formerly scriptEvaluationTimeout which is also supported but now deprecated).
The following options are allowed on a per-request basis in this fashion: batchSize, requestId, userAgent and evaluationTimeout (formerly scriptEvaluationTimeout which is also supported but now deprecated).
But I'd expect that to just be something if you're wanting to override the server side's default of 64? Ultimately what I'm wanting to do to is have some large result set requested, but only incrementally hold it in batches on the client side without having to hold the entire result set in memory on the client side.
Solution:
it is the batch size to the client. it just doesn't wait for the client to tell it to send the next batch. the purpose of the batch was to control the rough size of each response, otherwise you could end up with a situation where the server might be serlializing too much in memory or sending responses that exceeded the max content length for a response
Jump to solution
4 Replies
spmallette
spmallette12mo ago
sorry - i dont' think anyone noticed this question for some reason. unfortunately, the drivers don't work they way you're hoping. the server will continue to stream results according to batch size. it doesn't hold and wait for the client to ask for the next batch. i don't like to recommend this, but with JanusGraph, I guess you could use a session. you'd send a script to the server like:
t = g.V();t.next(64)
t = g.V();t.next(64)
you'd process those results. then on the next request you could do:
t.next(64)
t.next(64)
and keep doing that until you exhaust t. you'd want to do some sort of checking to ensure that your script doesn't end in a NoSuchElementException but you get the idea I hope. i don't like to recommend it because that approach only works if the server is using Groovy to process Gremlin scripts and not all graphs do that so your code loses portability. furthermore, I think we will be leaning even further away from Groovy in coming versions and this approach likely won't be available at some point in the future.
criminosis
criminosisOP12mo ago
Out of curiosity what's the batch intended to be used for then if not for batch size to the client? The graph provider from its backing data store?
Solution
spmallette
spmallette12mo ago
it is the batch size to the client. it just doesn't wait for the client to tell it to send the next batch. the purpose of the batch was to control the rough size of each response, otherwise you could end up with a situation where the server might be serlializing too much in memory or sending responses that exceeded the max content length for a response
criminosis
criminosisOP12mo ago
Got it. Thanks @spmallette

Did you find this page helpful?