Railway•12mo ago

Losing connection to redis after migration

Hello, i recently migrated to the new redis instance and have been getting errors randomly once a day and need to restart my container for it to work again. Anyone had similar issues after migrating?

[ioredis] Unhandled error event: Error: read ECONNRESET

[ioredis] Unhandled error event: Error: read ECONNRESET

55 Replies

EddyOP•12mo ago

c256fc68-1d5c-4c39-8e86-a964d7ff66f5

Brody•12mo ago

@matt - connection reset redis

arus•12mo ago

This happens with the new postgres container as well.

matt•12mo ago

@arus can you share more details? And is there a reliable wat to reproduce the issue? ty!

EddyOP•12mo ago

Could you find any reason why it happens in my project @matt ?

angelo•12mo ago

Hey there @Eddy - we had an issue with an #🚨｜incidents, the Railway team is working on a post-mortem.

arus•12mo ago

Only way to reproduce it is to wait ~3-8 hours without anyone calling one of the endpoints. Then I get an EOF detected error. I have to restart the bot container to reconnect to pg again. Hang on, I'll grab the full error. Lol and that isn't even fully true, it's been up 8 hours and I can't reproduce it yet. Command raised an exception: OperationalError: (psycopg2.OperationalError) SSL SYSCALL error: EOF detected (Background on this error at: https://sqlalche.me/e/14/e3q8) Already followed the guidance from sqlalchemy, it's still happening. It would sometimes happen after a month or two with the old postgres container, but now it's like 1-5 times a day.

Brody•12mo ago

are you making sure to close idle connections? seen this a lot where postgres would mark the connection as closed but the client doesn't know the connection was closed

arus•12mo ago

Yep

Brody•12mo ago

well 8 hours so far is good, if there errors again let us know

arus•12mo ago

Seems to have been stable overnight, hopefully whatever happened yesterday fixed it (I'm in the West region mentioned in incidents)

angelo•12mo ago

Yep it was likely the outage

EddyOP•12mo ago

Yeah same here, happened 2 nights in a row and tonight it was stable

arus•12mo ago

Sounds about right. Also lol that av Angelo. I haven't seen that frog in years. All seems fine now yeah. Down again @Angelo

arus•12mo ago

angelo•12mo ago

Hmm- are you closing your connections?

arus•12mo ago

Yes. It happens after appx 20 hours now rather than 1-6 hours. Only started happening after I migrated to the new container. And only if it's idle the entire time. The bot container is not set to sleep but sometimes seems to anyway. Resource allocation in my region maybe?

arus•12mo ago

This is the specific error I get on the client side. https://docs.sqlalchemy.org/en/14/core/pooling.html#pool-disconnects I am using a pessimistic method to recover. Looking at my logs, the main loop seems to be restarting while the container is running sometimes. The other behavior I notice if I try and pull before the EOF message are extremely delayed server responses in the region, when building/restarting especially. I'm going to give null pools a try again though and I'll let you know.

Brody•12mo ago

im still thinking that your problem is related to keeping stale connections around, this problem its mentioned in the knexjs docs. its a javascript package, but the same can apply for any pooled postgres client within a docker environment. https://knexjs.org/guide/#pool

It can result in problems with stale connections

arus•12mo ago

I'll take a look.

arus•12mo ago

Yeah, I'm following this guidence which won't use a connection without checking it first. https://stackoverflow.com/a/66360789

Stack Overflow

psycopg2.OperationalError: SSL SYSCALL error: EOF detected on Flask...

I have an app that was written with Flask+SQLALchemy+Celery, RabbitMQ as a broker, database is PostgreSQL (PostgreSQL 10.11 (Ubuntu 10.11-1.pgdg16.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubu...

arus•12mo ago

Given this other article though, it does align with the lag theory. The lag could be exceeding the keep alive.

Brody•12mo ago

sorry do you mean youre going to use pool_pre_ping=True going forward, or have you already have been using it?

arus•12mo ago

I have been already.

Brody•12mo ago

are you using the private url?

arus•12mo ago

https://stackoverflow.com/a/66515677 I don't recall. Let me check.

Stack Overflow

Postgres SSL SYSCALL error: EOF detected with python and psycopg

Using psycopg2 package with python 2.7 I keep getting the titled error: psycopg2.DatabaseError: SSL SYSCALL error: EOF detected It only occurs when I add a WHERE column LIKE ''%X%'' clause to my

arus•12mo ago

roundhouse.proxy.rlwy.net Same variable as before, but the migration populated a lot of it I can try the private url

Brody•12mo ago

can you try using the DATABASE_PRIVATE_URL variable

arus•12mo ago

Yeah, let me switch off my phone. Alright, deploying. I'll let you know if it stays connected.

Brody•12mo ago

sounds good

arus•12mo ago

Alright, it won't connect to the private url. Says the hostname can't be found.

Brody•12mo ago

building with nixpacks?

arus•12mo ago

could not translate host name "postgres.railway.internal" to address: Name or service not known yes

Brody•12mo ago

can you try adding a 3 second sleep to the beginning of your start command?

arus•12mo ago

Yeah one sec. nope

Brody•12mo ago

postgres is in the same project right?

arus•12mo ago

Yes I'm going to try adding sleep to my cog setup functions. Didn't work either my nixbuild runs this. docker run -it us-west1.registry.rlwy.net/ let me check if postgres is in the same region. Bleh, yes, I'm hobby plan too So I couldn't change it if I wanted to

Brody•12mo ago

does the dns lookup that SQLAlchemy does support ipv6?

arus•12mo ago

Pretty sure it does. Let me check this version real quick. yes, it does.

Brody•12mo ago

does the start command in the build table at the top of the build logs confirm that there is a sleep 3?

arus•12mo ago

No, but that code is wrapped in a script.

Brody•12mo ago

can you change your start command to sleep 3 && <your current start command>

arus•12mo ago

yeah one sec. okay, looks like it didn't explode this time.

Brody•12mo ago

make sure you are using a healthcheck now though https://docs.railway.app/guides/healthchecks-and-restarts

arus•12mo ago

Not entirely sure how I'm going to do that just yet, but i'll look into it.

Brody•12mo ago

do you already have a web framework in place or is this a bot app?

arus•12mo ago

Bot

Brody•12mo ago

ah then dont worry about the health check

arus•12mo ago

I'm thinking of adding some short lived auth endpoints to connect users to their own content though, so I'll probably throw it in when I do that. Alright, gonna let this sucker idle for a couple days and see if the errors are done.

Brody•12mo ago

yep that would be the time to add a healthcheck sounds good

arus•12mo ago

Thanks! Down again.

arus•12mo ago

I checked, the client reported connecting but not receiving a response.

EddyOP•12mo ago

Same here

1:M 11 Dec 2023 16:24:55.577 # Possible SECURITY ATTACK detected. It looks like somebody is sending POST or Host: commands to Redis. This is likely due to an attacker attempting to use Cross Protocol Scripting to compromise your Redis instance. Connection from 192.168.16.4:***** aborted.

1:M 11 Dec 2023 16:24:55.577 # Possible SECURITY ATTACK detected. It looks like somebody is sending POST or Host: commands to Redis. This is likely due to an attacker attempting to use Cross Protocol Scripting to compromise your Redis instance. Connection from 192.168.16.4:***** aborted.

devon•10mo ago

This is still happening for me, crashes every roughly 4 days. Did anyone figure out a clear resolution. Otherwise I'm going to need to migrate away from Railway entirely as this is not stable for production. Note I have healthchecks and everything.

Gaming

Programming

Losing connection to redis after migration