Gremlin python trying to connect Neptune WS when is down

Im trying to capture and control the exception when the database is down with gremlin but instead the queries or writers wait to max time limit, in this case lambda timeout, to report a problem, someone know how to set a time limit when this happen to avoid to use all time of a lambda timeout to close the connection?
No description
26 Replies
Andrea
Andrea2mo ago
hello @masterhugo what version of the python driver are you using?
masterhugo
masterhugoOP2mo ago
Gremlin python 3.7.1 AWS Neptune 1.3.1.0
Andrea
Andrea2mo ago
Looking at the reference documentation for python driver configuration you could try to customize the transport_factory which has some timeout settings I'm not sure if the read_timeout would apply to connection timeout, which is what I believe you are looking for specifically It seems the java driver has a connection timeout setting but I do not see one for the python driver
masterhugo
masterhugoOP2mo ago
I tried that read_timeout but it doesn’t work, still continue waiting to max time lambda And I tried to send different attributes like timeout from aiohttp but did the same behavior
Andrea
Andrea2mo ago
Are you referring to manually customizing the aiohttp.ClientSession's timeout configuration ? Curious how you set this value? I see examples documented here
masterhugo
masterhugoOP2mo ago
I hope it will change it, I sent timeout value to the driverremoteconnection as a param but it didn’t work 🥹 this is my connection code
driver_remote_connection.DriverRemoteConnection(
pool_size=1,
call_from_event_loop=True,
url=base_url,
message_serializer=serializer.GraphSONSerializersV3d0(),
headers=aws_auth_request,
read_timeout=1,
timeout=1,
)
driver_remote_connection.DriverRemoteConnection(
pool_size=1,
call_from_event_loop=True,
url=base_url,
message_serializer=serializer.GraphSONSerializersV3d0(),
headers=aws_auth_request,
read_timeout=1,
timeout=1,
)
Andrea
Andrea2mo ago
From what I can tell I don't see a way to set this via DriverRemoteConnection and it would require a python driver code change
masterhugo
masterhugoOP2mo ago
this configuration made to response after 15 seconds, but the problem continue because the lambda still waits even the connection return error.
driver_remote_connection.DriverRemoteConnection(
pool_size=1,
url=base_url,
message_serializer=serializer.GraphSONSerializersV3d0(),
headers=aws_auth_request,
transport_factory=lambda: transport.AiohttpTransport(
read_timeout=5,
heartbeat=2,
call_from_event_loop=True,
),
)
driver_remote_connection.DriverRemoteConnection(
pool_size=1,
url=base_url,
message_serializer=serializer.GraphSONSerializersV3d0(),
headers=aws_auth_request,
transport_factory=lambda: transport.AiohttpTransport(
read_timeout=5,
heartbeat=2,
call_from_event_loop=True,
),
)
No description
masterhugo
masterhugoOP2mo ago
No description
spmallette
spmallette2mo ago
i'd like to be sure i understand the problem here. is the problem that the driver times out at a particular point but then somehow the lambda continues to wait doing nothing until it hits its timeout period?
masterhugo
masterhugoOP2mo ago
That’s right, and it’s only after I tried to call the gremlin queries with the database disabled
triggan
triggan2mo ago
Just for clarity, when you say "disabled" is the cluster in a Stopped state? or did you delete all instances from the cluster? Or is the cluster completely deleted? instance being rebooted?
masterhugo
masterhugoOP2mo ago
Stopped state In this case is a Neptune Serverless with one instance
triggan
triggan2mo ago
And to further clarify, you mean you actually stopped the cluster using the start/stop API (https://docs.aws.amazon.com/neptune/latest/userguide/manage-console-stop-start.html) not that the Serverless instance had scaled to 1? (Just trying to get a clearer picture of what might be going on here).
Stopping and starting an Amazon Neptune DB cluster - Amazon Neptune
Stop and start all DB instances in an Amazon Neptune cluster at once.
masterhugo
masterhugoOP2mo ago
I stopped the cluster in the console just to validate this case, in production the behavior is the same but it only happens when I run multiple upserts and the database freezes with this due to high demand, so I replicated the same behavior with stopping the database, that's why I made it stop the cluster.
Andrea
Andrea2mo ago
curious what your lambda code looks like? is it possible there is some retry mechanism happening to attempt to reconnect when a connection failure is detected? The documented python driver example has such a mechanism.
AWS Lambda function examples for Amazon Neptune - Amazon Neptune
The following example AWS Lambda functions, written in Java, JavaScript and Python, illustrate upserting a single vertex with a randomly generated ID using the fold().coalesce().unfold() idiom.
masterhugo
masterhugoOP2mo ago
I removed all backoff i have, but the problem persist here is my code https://github.com/masterhugo/privateCodes/blob/main/NeptuneAdapter.py i made public the repo 🙈
Andrea
Andrea2mo ago
Regarding the driver timeout config I think I was wrong about not being able to configure the connection timeout specifically - this kind of config might be possible:
timeout = ClientTimeout(
total=5, # Total timeout for the connection (connect + read)
connect=2, # Timeout for establishing the connection
read=5, # Timeout for waiting for data after the connection
)
return driver_remote_connection.DriverRemoteConnection(
pool_size=1,
url=base_url,
message_serializer=serializer.GraphSONSerializersV3d0(),
headers=aws_auth_request,
transport_factory=lambda: transport.AiohttpTransport(
timeout=timeout,
heartbeat=2,
call_from_event_loop=True,
),
)`
timeout = ClientTimeout(
total=5, # Total timeout for the connection (connect + read)
connect=2, # Timeout for establishing the connection
read=5, # Timeout for waiting for data after the connection
)
return driver_remote_connection.DriverRemoteConnection(
pool_size=1,
url=base_url,
message_serializer=serializer.GraphSONSerializersV3d0(),
headers=aws_auth_request,
transport_factory=lambda: transport.AiohttpTransport(
timeout=timeout,
heartbeat=2,
call_from_event_loop=True,
),
)`
Andrea
Andrea2mo ago
Regarding the lambda still waiting for full timeout after connection error I am not very familiar with lambda retry/timeout logic but would it help to raise an error if a connection error is detected? something like:
try:
NeptuneConnectionManager.create_remote_connection(host, port)
except (client_exceptions.ClientConnectorError, OSError) as e:
raise RuntimeError(f"Connection to Neptune failed: {e}")
try:
NeptuneConnectionManager.create_remote_connection(host, port)
except (client_exceptions.ClientConnectorError, OSError) as e:
raise RuntimeError(f"Connection to Neptune failed: {e}")
Also can try configuring the lambda to reduce the number of reties or max event age
Configuring error handling settings for Lambda asynchronous invocat...
You can use the AWS CLI or the Lambda console to configure how Lambda handles errors and retries for your function when you invoke it asynchronously.
masterhugo
masterhugoOP2mo ago
Hmm I’m gonna try, but the problem appears after I call the query instead of the creation of the connection, I suppose that there is where the connection is going to start My problem persist but because its the timeout connection still wait like 15 seconds even when i set ClientTimeout with loew values, like 1 or 2
masterhugo
masterhugoOP2mo ago
and looking on gremlin-python library, if I send the ClientTimeout, it doesn't send on parameters in ws_connect as a kwargs
No description
masterhugo
masterhugoOP2mo ago
🥹
Yang Xia
Yang Xia2mo ago
The top line should still pass the different args into the aiohttp client though? The subsequent ifs are just mapping the driver specific names to aiohttp specific ones, but shouldn't affect the rest. But yea not entirely sure if there's much else to do in the driver that can help, might just need to add some manual checks to throw errors in the lamba itself? Also, @Lyndon would you have any idea around the transport code since you've worked on it?
Lyndon
Lyndon2mo ago
Hey sorry just saw this. Looking now
Lyndon
Lyndon2mo ago
The intention is that when connect is called, it passes in the connection options here https://github.com/apache/tinkerpop/blob/3.7.1/gremlin-python/src/main/python/gremlin_python/driver/aiohttp/transport.py#L69 via the aiohttp_kwargs. Now looking at aiohttp.ClientSession.ws_connect, we see that the ClientSession takes in a timeout=ClientTimeout directly, stating default is 5 minutes, with 30 seconds for socket timeout. This was added in 3.7 and tinkerpop expects 3.8 or greater so this should be usable. That said, we do not actually pass any args to the ClientSession and instead pass to ws_connect. ws_connect also takes a timeout, which is actually of aiohttp.ClientWSTimeout type. So in theory, if you pass in a timeout of that type to it, it should pass right through, but with that said I tried it and it didn't work for me, however I was constantly getting a timeout of 3 seconds, not 15 like you experienced so I am not entirely sure what is going on there. I am not sure why this is not working, or why our environments are getting different default timeouts. Are you trying to connect to a random ip that is invalid or something different? I just tried using my docker network with an invalid ip addr to test myself.
GitHub
tinkerpop/gremlin-python/src/main/python/gremlin_python/driver/aioh...
Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.
masterhugo
masterhugoOP2mo ago
I’m using the dns of Neptune cluster and with the basic configuration of vpc and iam enabled request, but yeah, it’s only the query after I launch that next o maybe when I try to find the has_next() do something different, cause first I verify that and then I use next() to get the query results. And I catch the exception and stop sending requests to avoid more timeout on lambda after the Andrea’s comment

Did you find this page helpful?