Gremlin python trying to connect Neptune WS when is down
Im trying to capture and control the exception when the database is down with gremlin but instead the queries or writers wait to max time limit, in this case lambda timeout, to report a problem, someone know how to set a time limit when this happen to avoid to use all time of a lambda timeout to close the connection?

26 Replies
hello @masterhugo what version of the python driver are you using?
Gremlin python 3.7.1
AWS Neptune 1.3.1.0
Looking at the reference documentation for python driver configuration you could try to customize the
transport_factory
which has some timeout settings
I'm not sure if the read_timeout
would apply to connection timeout, which is what I believe you are looking for specifically
It seems the java driver has a connection timeout setting but I do not see one for the python driverI tried that read_timeout but it doesn’t work, still continue waiting to max time lambda
And I tried to send different attributes like timeout from aiohttp but did the same behavior
Are you referring to manually customizing the aiohttp.ClientSession's timeout configuration ? Curious how you set this value? I see examples documented here
I hope it will change it, I sent timeout value to the driverremoteconnection as a param but it didn’t work 🥹
this is my connection code
From what I can tell I don't see a way to set this via DriverRemoteConnection and it would require a python driver code change
this configuration made to response after 15 seconds, but the problem continue because the lambda still waits even the connection return error.


i'd like to be sure i understand the problem here. is the problem that the driver times out at a particular point but then somehow the lambda continues to wait doing nothing until it hits its timeout period?
That’s right, and it’s only after I tried to call the gremlin queries with the database disabled
Just for clarity, when you say "disabled" is the cluster in a Stopped state? or did you delete all instances from the cluster? Or is the cluster completely deleted? instance being rebooted?
Stopped state
In this case is a Neptune Serverless with one instance
And to further clarify, you mean you actually stopped the cluster using the start/stop API (https://docs.aws.amazon.com/neptune/latest/userguide/manage-console-stop-start.html) not that the Serverless instance had scaled to 1? (Just trying to get a clearer picture of what might be going on here).
Stopping and starting an Amazon Neptune DB cluster - Amazon Neptune
Stop and start all DB instances in an Amazon Neptune cluster at once.
I stopped the cluster in the console just to validate this case, in production the behavior is the same but it only happens when I run multiple upserts and the database freezes with this due to high demand, so I replicated the same behavior with stopping the database, that's why I made it stop the cluster.
curious what your lambda code looks like? is it possible there is some retry mechanism happening to attempt to reconnect when a connection failure is detected? The documented python driver example has such a mechanism.
AWS Lambda function examples for Amazon Neptune - Amazon Neptune
The following example AWS Lambda functions, written in Java, JavaScript and Python, illustrate upserting a single vertex with a randomly generated ID using the fold().coalesce().unfold() idiom.
I removed all backoff i have, but the problem persist
here is my code
https://github.com/masterhugo/privateCodes/blob/main/NeptuneAdapter.py
i made public the repo 🙈
Regarding the driver timeout config I think I was wrong about not being able to configure the connection timeout specifically - this kind of config might be possible:
Regarding the lambda still waiting for full timeout after connection error I am not very familiar with lambda retry/timeout logic but would it help to raise an error if a connection error is detected? something like:
Also can try configuring the lambda to reduce the number of reties or max event age
Configuring error handling settings for Lambda asynchronous invocat...
You can use the AWS CLI or the Lambda console to configure how Lambda handles errors and retries for your function when you invoke it asynchronously.
Hmm I’m gonna try, but the problem appears after I call the query instead of the creation of the connection, I suppose that there is where the connection is going to start
My problem persist but because its the timeout connection still wait like 15 seconds even when i set ClientTimeout with loew values, like 1 or 2
and looking on gremlin-python library, if I send the ClientTimeout, it doesn't send on parameters in ws_connect as a kwargs

🥹
The top line should still pass the different args into the aiohttp client though? The subsequent ifs are just mapping the driver specific names to aiohttp specific ones, but shouldn't affect the rest.
But yea not entirely sure if there's much else to do in the driver that can help, might just need to add some manual checks to throw errors in the lamba itself?
Also, @Lyndon would you have any idea around the transport code since you've worked on it?
Hey sorry just saw this. Looking now
The intention is that when connect is called, it passes in the connection options here https://github.com/apache/tinkerpop/blob/3.7.1/gremlin-python/src/main/python/gremlin_python/driver/aiohttp/transport.py#L69 via the aiohttp_kwargs.
Now looking at aiohttp.ClientSession.ws_connect, we see that the ClientSession takes in a timeout=ClientTimeout directly, stating default is 5 minutes, with 30 seconds for socket timeout. This was added in 3.7 and tinkerpop expects 3.8 or greater so this should be usable. That said, we do not actually pass any args to the ClientSession and instead pass to ws_connect. ws_connect also takes a timeout, which is actually of aiohttp.ClientWSTimeout type.
So in theory, if you pass in a timeout of that type to it, it should pass right through, but with that said I tried it and it didn't work for me, however I was constantly getting a timeout of 3 seconds, not 15 like you experienced so I am not entirely sure what is going on there.
I am not sure why this is not working, or why our environments are getting different default timeouts. Are you trying to connect to a random ip that is invalid or something different? I just tried using my docker network with an invalid ip addr to test myself.
GitHub
tinkerpop/gremlin-python/src/main/python/gremlin_python/driver/aioh...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
I’m using the dns of Neptune cluster and with the basic configuration of vpc and iam enabled request, but yeah, it’s only the query after I launch that next o maybe when I try to find the has_next() do something different, cause first I verify that and then I use next() to get the query results. And I catch the exception and stop sending requests to avoid more timeout on lambda after the Andrea’s comment