Apache TinkerPop•3mo ago

Gremlin python trying to connect Neptune WS when is down

Im trying to capture and control the exception when the database is down with gremlin but instead the queries or writers wait to max time limit, in this case lambda timeout, to report a problem, someone know how to set a time limit when this happen to avoid to use all time of a lambda timeout to close the connection?

26 Replies

Andrea•3mo ago

hello @masterhugo what version of the python driver are you using?

masterhugoOP•3mo ago

Gremlin python 3.7.1 AWS Neptune 1.3.1.0

Andrea•3mo ago

Looking at the reference documentation for python driver configuration you could try to customize the transport_factory which has some timeout settings I'm not sure if the read_timeout would apply to connection timeout, which is what I believe you are looking for specifically It seems the java driver has a connection timeout setting but I do not see one for the python driver

masterhugoOP•3mo ago

I tried that read_timeout but it doesn’t work, still continue waiting to max time lambda And I tried to send different attributes like timeout from aiohttp but did the same behavior

Andrea•3mo ago

Are you referring to manually customizing the aiohttp.ClientSession's timeout configuration ? Curious how you set this value? I see examples documented here

masterhugoOP•3mo ago

I hope it will change it, I sent timeout value to the driverremoteconnection as a param but it didn’t work 🥹 this is my connection code

driver_remote_connection.DriverRemoteConnection(
            pool_size=1,
            call_from_event_loop=True,
            url=base_url,
            message_serializer=serializer.GraphSONSerializersV3d0(),
            headers=aws_auth_request,
            read_timeout=1,
            timeout=1,
        )

driver_remote_connection.DriverRemoteConnection(
            pool_size=1,
            call_from_event_loop=True,
            url=base_url,
            message_serializer=serializer.GraphSONSerializersV3d0(),
            headers=aws_auth_request,
            read_timeout=1,
            timeout=1,
        )

Andrea•3mo ago

From what I can tell I don't see a way to set this via DriverRemoteConnection and it would require a python driver code change

masterhugoOP•3mo ago

this configuration made to response after 15 seconds, but the problem continue because the lambda still waits even the connection return error.

driver_remote_connection.DriverRemoteConnection(
            pool_size=1,
            url=base_url,
            message_serializer=serializer.GraphSONSerializersV3d0(),
            headers=aws_auth_request,
            transport_factory=lambda: transport.AiohttpTransport(
                read_timeout=5,
                heartbeat=2,
                call_from_event_loop=True,
            ),
        )

driver_remote_connection.DriverRemoteConnection(
            pool_size=1,
            url=base_url,
            message_serializer=serializer.GraphSONSerializersV3d0(),
            headers=aws_auth_request,
            transport_factory=lambda: transport.AiohttpTransport(
                read_timeout=5,
                heartbeat=2,
                call_from_event_loop=True,
            ),
        )

masterhugoOP•3mo ago

spmallette•3mo ago

i'd like to be sure i understand the problem here. is the problem that the driver times out at a particular point but then somehow the lambda continues to wait doing nothing until it hits its timeout period?

masterhugoOP•3mo ago

That’s right, and it’s only after I tried to call the gremlin queries with the database disabled

triggan•3mo ago

Just for clarity, when you say "disabled" is the cluster in a Stopped state? or did you delete all instances from the cluster? Or is the cluster completely deleted? instance being rebooted?

masterhugoOP•3mo ago

Stopped state In this case is a Neptune Serverless with one instance

triggan•3mo ago

And to further clarify, you mean you actually stopped the cluster using the start/stop API (https://docs.aws.amazon.com/neptune/latest/userguide/manage-console-stop-start.html) not that the Serverless instance had scaled to 1? (Just trying to get a clearer picture of what might be going on here).

Stopping and starting an Amazon Neptune DB cluster - Amazon Neptune

Stop and start all DB instances in an Amazon Neptune cluster at once.

masterhugoOP•3mo ago

I stopped the cluster in the console just to validate this case, in production the behavior is the same but it only happens when I run multiple upserts and the database freezes with this due to high demand, so I replicated the same behavior with stopping the database, that's why I made it stop the cluster.

Andrea•3mo ago

curious what your lambda code looks like? is it possible there is some retry mechanism happening to attempt to reconnect when a connection failure is detected? The documented python driver example has such a mechanism.

AWS Lambda function examples for Amazon Neptune - Amazon Neptune

The following example AWS Lambda functions, written in Java, JavaScript and Python, illustrate upserting a single vertex with a randomly generated ID using the fold().coalesce().unfold() idiom.

masterhugoOP•3mo ago

I removed all backoff i have, but the problem persist here is my code https://github.com/masterhugo/privateCodes/blob/main/NeptuneAdapter.py i made public the repo 🙈

Andrea•3mo ago

Regarding the driver timeout config I think I was wrong about not being able to configure the connection timeout specifically - this kind of config might be possible:

timeout = ClientTimeout(
        total=5,  # Total timeout for the connection (connect + read)
        connect=2,  # Timeout for establishing the connection
        read=5,  # Timeout for waiting for data after the connection
    )
return driver_remote_connection.DriverRemoteConnection(
            pool_size=1,
            url=base_url,
            message_serializer=serializer.GraphSONSerializersV3d0(),
            headers=aws_auth_request,
            transport_factory=lambda: transport.AiohttpTransport(
                timeout=timeout,
                heartbeat=2,
                call_from_event_loop=True,
            ),
        )`

timeout = ClientTimeout(
        total=5,  # Total timeout for the connection (connect + read)
        connect=2,  # Timeout for establishing the connection
        read=5,  # Timeout for waiting for data after the connection
    )
return driver_remote_connection.DriverRemoteConnection(
            pool_size=1,
            url=base_url,
            message_serializer=serializer.GraphSONSerializersV3d0(),
            headers=aws_auth_request,
            transport_factory=lambda: transport.AiohttpTransport(
                timeout=timeout,
                heartbeat=2,
                call_from_event_loop=True,
            ),
        )`

Andrea•3mo ago

Regarding the lambda still waiting for full timeout after connection error I am not very familiar with lambda retry/timeout logic but would it help to raise an error if a connection error is detected? something like:

try:
    NeptuneConnectionManager.create_remote_connection(host, port)
except (client_exceptions.ClientConnectorError, OSError) as e:
    raise RuntimeError(f"Connection to Neptune failed: {e}")

try:
    NeptuneConnectionManager.create_remote_connection(host, port)
except (client_exceptions.ClientConnectorError, OSError) as e:
    raise RuntimeError(f"Connection to Neptune failed: {e}")

Also can try configuring the lambda to reduce the number of reties or max event age

Configuring error handling settings for Lambda asynchronous invocat...

You can use the AWS CLI or the Lambda console to configure how Lambda handles errors and retries for your function when you invoke it asynchronously.

masterhugoOP•3mo ago

Hmm I’m gonna try, but the problem appears after I call the query instead of the creation of the connection, I suppose that there is where the connection is going to start My problem persist but because its the timeout connection still wait like 15 seconds even when i set ClientTimeout with loew values, like 1 or 2

masterhugoOP•3mo ago

and looking on gremlin-python library, if I send the ClientTimeout, it doesn't send on parameters in ws_connect as a kwargs

masterhugoOP•2mo ago

🥹

Yang Xia•2mo ago

The top line should still pass the different args into the aiohttp client though? The subsequent ifs are just mapping the driver specific names to aiohttp specific ones, but shouldn't affect the rest. But yea not entirely sure if there's much else to do in the driver that can help, might just need to add some manual checks to throw errors in the lamba itself? Also, @Lyndon would you have any idea around the transport code since you've worked on it?

Lyndon•2mo ago

Hey sorry just saw this. Looking now

Lyndon•2mo ago

The intention is that when connect is called, it passes in the connection options here https://github.com/apache/tinkerpop/blob/3.7.1/gremlin-python/src/main/python/gremlin_python/driver/aiohttp/transport.py#L69 via the aiohttp_kwargs. Now looking at aiohttp.ClientSession.ws_connect, we see that the ClientSession takes in a timeout=ClientTimeout directly, stating default is 5 minutes, with 30 seconds for socket timeout. This was added in 3.7 and tinkerpop expects 3.8 or greater so this should be usable. That said, we do not actually pass any args to the ClientSession and instead pass to ws_connect. ws_connect also takes a timeout, which is actually of aiohttp.ClientWSTimeout type. So in theory, if you pass in a timeout of that type to it, it should pass right through, but with that said I tried it and it didn't work for me, however I was constantly getting a timeout of 3 seconds, not 15 like you experienced so I am not entirely sure what is going on there. I am not sure why this is not working, or why our environments are getting different default timeouts. Are you trying to connect to a random ip that is invalid or something different? I just tried using my docker network with an invalid ip addr to test myself.

GitHub

tinkerpop/gremlin-python/src/main/python/gremlin_python/driver/aioh...

Apache TinkerPop - a graph computing framework. Contribute to apache/tinkerpop development by creating an account on GitHub.

masterhugoOP•2mo ago

I’m using the dns of Neptune cluster and with the basic configuration of vpc and iam enabled request, but yeah, it’s only the query after I launch that next o maybe when I try to find the has_next() do something different, cause first I verify that and then I use next() to get the query results. And I catch the exception and stop sending requests to avoid more timeout on lambda after the Andrea’s comment

Gaming

Programming

Gremlin python trying to connect Neptune WS when is down

Did you find this page helpful?