Railway•11mo ago

Did something related to private networking fundamentally change in the past couple of days?

We have a simple Django app and use Railway's Redis. Over the past couple of days, we've been seeing the following happen: Deploy the server -> In a couple hours, we start getting

ERROR:django_datadog_logger.middleware.error_log:Error -2 connecting to redis.railway.internal:6379. Name or service not known

where all connection attempts to Redis fails. Our code hadn't changed and we just wanted to double-check if there's anything fundamentally changed with the network layer.

Solution:

@unstoppablerizz The fix for this race condintion has been merged and deployed. This should not happen to you again, if you want to remove the 3 second delay (assuming you added one). You can but as brody showed us, that will generally just result in errors. However, if Django already takes long enough you might not need an additional delay....

Jump to solution

176 Replies

Percy•11mo ago

Project ID: 120b5ec5-59d8-4087-84ae-4e0b3d934aa7

johnsOP•11mo ago

120b5ec5-59d8-4087-84ae-4e0b3d934aa7

macwilko•11mo ago

does it connect immediately to the thing on Private Networking? I have found it's needed to add a short 3 second delay, before connecting on Private Networking

johnsOP•11mo ago

Yeah, initial couple hours is not the issue and the delay we need in the beginning is not the issue.

macwilko•11mo ago

I don't believe anything has fundamentally changed with Private networking if it was previously working it could of been good luck maybe

theodor•11mo ago

Didn't you guys have an outage with k8s 503s? @macwilko We've been running this with quite a lot of users for the past 9 months so I think good luck is a stretch We haven't touched the code there for over a week which is quite suspicious. Only started happening after the 503 issue Can we escalate this somehow and get a bit better treatment? We're paying quite a bit to railway after all

Brody•11mo ago

@thomas - would you know or could you check to see if there where any changes with the private network or any of the surrounding infrastructure? my testing indicates zero behavior changes, but who knows

theodor•11mo ago

Thanks @Brody! We're moving over to Azure Redis for now Do you guys use Azure or AWS? We can use either but we'd prefer to be on the same cloud provider cc @Brody

Brody•11mo ago

railway runs on GCP at the moment

macwilko•11mo ago

do you see similar problems if you connect over the public url?

thomas•11mo ago

I am looking at this now. But the next level of support is direct support.

johnsOP•11mo ago

~~Well we just use the REDIS_URL so it is using the public url~~

Brody•11mo ago

connecting to redis.railway.internal:6379. Name or service not known

connecting to redis.railway.internal:6379. Name or service not known

you are using the private url

theodor•11mo ago

@thomas is this not the way to get that? or how should we tap into it

johnsOP•11mo ago

Right we're using the REDIS_PRIVATE_URL. My fault

thomas•11mo ago

So at the moment, we have 3 tiers of support, community, team, and direct. They correspond with Hobby, Team, and what I would call enterprise. Direct support you get a private channel with the team on slack or discord.

johnsOP•11mo ago

I remember - when we used the public url, the latency went up by 5~10x when we migrated to v2 from v1 in december, which is why I specifically changed to use the private url. Using the private url brought down the latency

thomas•11mo ago

@unstoppablerizz Which service is throwing thiese logs. And yes using the private URL is recommended for links between all services.

johnsOP•11mo ago

Service ID d89b8d57-7c05-4bab-bf8a-27bc11f78cbc

johnsOP•11mo ago

Basically, after those spikes happen, the connection to redis doesn't seem to work at all

johnsOP•11mo ago

I was trying to look through the logs but I keep deploying over the old instance so they get truncated

thomas•11mo ago

that screen shot is from the reddit instance?

johnsOP•11mo ago

No, it's the screenshot from Service ID d89b8d57-7c05-4bab-bf8a-27bc11f78cbc which is our Django server instance Maybe it’s that we’re exhausting the max simultaneous connection limit of Redis somehow? Do you guys monitor this

Brody•11mo ago

i dont think you are opening 10k conns

thomas•11mo ago

That's not a bad way path to consider. Do all connections drop or just some? You said that all connection attempts to redis fail in the first part of this. Considering the replications you have running and the scale of the CPU you are using it's possible

Brody•11mo ago

side question, are you deplying your app with nixpacks, or a dockerfile?

thomas•11mo ago

I am bringing this up internally. Brody also is one of most helpful community members.

Brody•11mo ago

i'm pretty sure this is char's wheelhouse, he would instantly know if there was any changes to the internal dns resolver

thomas•11mo ago

@unstoppablerizz Is there a reason all your logs for this service are piped to stderr? the red wall is not what I would consider normal logging practices and honestly sent a shiver down my spine. (this is unrelated to your issue, but I am honestly curious)

johnsOP•11mo ago

Dockerfile

thomas•11mo ago

I have been made to understand that is standard for some libs

Brody•11mo ago

is it based on an alpine image?

thomas•11mo ago

Are you esablishing a new reddit connection for each API request or using a pool of connections?

johnsOP•11mo ago

We have datadog so that’s how we set up monitoring / tracing / alerts, we don’t really read the server logs on Railway because it’s not really usable Let me double check We’re using a pool

thomas•11mo ago

So you shouldn't be running out of connections unless your pool is over 10000? Yeah, at your scale, we understand that, and this is one of the things that we will be improving.

johnsOP•11mo ago

When we started seeing the issues, we were using the pool + Django’s default cache backend module which I believe does pool connections Though even if we don’t use a pool, I don’t think our usage is that much to warrant this issue. And besides, our load didn’t drastically change from last week The most recent failure (one at 1:40am) could be because I explicitly defined connection pool of 2000 which may have exhausted the # of connections, but this wasn’t the case before last night

Brody•11mo ago

where you able to check?

johnsOP•11mo ago

We're using python:3.11 base image. Checking if that's alpine

Brody•11mo ago

thats debian based there goes that idea lol

johnsOP•11mo ago

Oh I think I saw the spike again a couple tens of minutes ago

johnsOP•11mo ago

thomas•11mo ago

Do you have a graph of the freaqueny of the error from data dog?

johnsOP•11mo ago

Yeah the tricky thing with this is that it doesn't get traced properly We unfortunately only have the wall of red But since it's pretty fresh, you should be able to see the socket errors happen from that point starting 10:06am (first bump) If not, then after 10:34am

thomas•11mo ago

I have paged infrastructure on-call, and they are looking into this as well now. We are attempting to determine if this is an incident or not.

johnsOP•11mo ago

We're also not logging everything to stderr - we're only logging errors to stderr (have log level set to ERROR)

thomas•11mo ago

Understood

johnsOP•11mo ago

But 400s are also "logged" as errors which can be safely ignored most of the time

thomas•11mo ago

johnsOP•11mo ago

thomas•11mo ago

I see what you mean by intermitent

johnsOP•11mo ago

trying to look at what happens at 10:08:20 Because I think that's when it started again Just a quick feedback - I don't think our multiline logs are actual individual logs, it's just newlines on a trace. Probably what makes me never want to use the log because it's impossible to trace I didn't bother setting a jsonformatter

johnsOP•11mo ago

I do notice something odd - our server seems to restart right before that happens. Not sure why

johnsOP•11mo ago

?: (staticfiles.W004) The directory '/code/static' in the STATICFILES_DIRS setting does not exist. <- this only happens on server start Also pagination on search -> load more is wrong. Likely some issue with cursor based pagination

Brody•11mo ago

at what point in your apps lifecycle is the redis pool created? at start?

johnsOP•11mo ago

Defined in the global scope, yes

thomas•11mo ago

ok, that's a good clue

Brody•11mo ago

at startup?

johnsOP•11mo ago

Python Redis pool is threadsafe so it should be fine yeah...perhaps the restarts somehow eat up the connections

thomas•11mo ago

The private network takes time to initialize after a server restart. But the thing is that it should go away then

johnsOP•11mo ago

yeah I don't think that's the issue

Brody•11mo ago

my current thinking is that their app is crashing, then re-creating the pool on startup before the network initiation has finished, thus resulting in dns errors

johnsOP•11mo ago

The thing is our code hadn't changed before this started happening afaik I'll double check my commits

Brody•11mo ago

the spikes are a good sign that the app crashed

thomas•11mo ago

That's a counterpoint, but is this load level reaching new heights?

johnsOP•11mo ago

If you search for "System check identified some issues" in the logs, you can see that it's sparingly found (only happens on deploy). Starting Jan 16th, it happens a lot more even when we're not deploying. No not at all Barely more, and we've had much more before

johnsOP•11mo ago

Though double-checking

Brody•11mo ago

do you have a 3 second sleep before you are creating the redis pool on startup?

johnsOP•11mo ago

We don't but startup already takes ~10 seconds anyways for django since we run migration on server start before we initialize the server, so there's no way there's a connection issue Besides it never was an issue basically all the random restarts here I think could be the source of the issue. We hadn't done any deploys at that time, so my guess is something about Railway's infra changed at that time

Brody•11mo ago

i have historical data and the network intonation times have not changed, still sitting at an average of 1.57 seconds definitely worth looking into why your app is crashing

johnsOP•11mo ago

It could be something about a k8 setting that you guys are using internally, or how hardware is allocated for the pods Though I have no idea about the internals

thomas•11mo ago

Do you have time for that time you reference?

johnsOP•11mo ago

Exactly around Jan 16 18pm. Not sure this is UTC The only new deployment I see from our side after Jan 13th is on 17th, and I think it was a reaction to the errors we were seeing

thomas•11mo ago

We do ask that initial private network attempts be retried. In the first 100ms: docs

Railway Docs

Private Networking | Railway Docs

Documentation for Railway

Brody•11mo ago

side note, railway doesn’t use k8s for any user workloads, and as far as i know they haven't changed the runtime yet

thomas•11mo ago

I am looking at our commit PR log to check

johnsOP•11mo ago

I think if you look back and forth 2~3 hours from there, maybe something can be found? But I'll also continue diagnosing

thomas•11mo ago

We recommand you change your restart policy to restart always for your API service We ran into a strange condition where one of your replicas was in an exited state.

johnsOP•11mo ago

ok changed it

thomas•11mo ago

Still looking into other things on our side just something that was noticed where did you get this date, and what would your local time be if this isn't UTC

johnsOP•11mo ago

The date is from the Railway log itself so whatever it is Ok it seems like it's PST based given what I see in the most recent log matching my timezone so Jan 16 18PM PST

macwilko•11mo ago

i believe logs should be in UTC on Railway

johnsOP•11mo ago

I meant just from my screenshot

johnsOP•11mo ago

I see this in the most recent screenshot, which means the display is based on my current timezone, so given that the screenshot that I was referring to for when the errors were starting has ~Jan 16 18PM, thomas should look at the commits using the PST. I'm sure on the backend they save it as UTC

thomas•11mo ago

Yes, but we do convert it to local in the UI! So that time is in PST One of our infra engineers asks if you can you modify the code to log the IP of redis.railway.internal when you init a connection so we can confirm if it's DNS related or not? If you put this in where the connections are started in the pool that should avoid flooding the logs.

johnsOP•11mo ago

😓 I've done this before but it was a lot of trial and error - do you happen to have the python code for it

Brody•11mo ago

https://www.tutorialspoint.com/python_network_programming/python_dns_look_up.htm you want to lookup the AAAA value

thomas•11mo ago

You can share your connection code and Brody or I might be able to help

johnsOP•11mo ago

Just double checking - am I supposed to look up for redis.railway.internal:6379?

Brody•11mo ago

just the hostname

thomas•11mo ago

no, that's an internal log sorry

johnsOP•11mo ago

We're only setting the REDIS_URL, which contains everything redis://<user>:<password>@<host>:<port> so

thomas•11mo ago

If that works sometimes, it should always work

johnsOP•11mo ago

@Brody .query is deprecated

def print_aaaa_record(domain: str):
    print("print_aaaa_record")
    try:
        answers = dns.resolver.resolve(domain, "AAAA", search=True)
        for rdata in answers:
            print(domain, "has AAAA record", rdata.to_text())
    except dns.resolver.NoAnswer:
        print("No AAAA record found.")
    except Exception as e:
        print(f"An error occurred: {e}")

def print_aaaa_record(domain: str):
    print("print_aaaa_record")
    try:
        answers = dns.resolver.resolve(domain, "AAAA", search=True)
        for rdata in answers:
            print(domain, "has AAAA record", rdata.to_text())
    except dns.resolver.NoAnswer:
        print("No AAAA record found.")
    except Exception as e:
        print(f"An error occurred: {e}")

perhaps this works? Though I'm pretty skeptical that it's going to work on the entire redis connection string

Brody•11mo ago

please only lookup the host

johnsOP•11mo ago

monorail.proxy.rlwy.net just this right?

Brody•11mo ago

thats the public host

johnsOP•11mo ago

I know, just the form of it

Brody•11mo ago

ah then yes, thats a hostname

thomas•11mo ago

@unstoppablerizz can you tag me when this happens again we have an idea we would like to try but we need it to be happening

johnsOP•11mo ago

sure

thomas•11mo ago

I have a larger updating coming, we did find something. Just confirming best next steps with the team @Conductor This user can override rule 5 today and tag me directly.

Brody•11mo ago

for complete transparency Thomas, in my testing, over the time span of 7 days, and 600k successful internal dns lookups I get about 6 to 7 failed dns lookups. though this doesn't explain their issues, and my test doesn't cover all the deployed resolvers, just thought I'd mention it

thomas•11mo ago

To reproduce this you would have to crash your container then use the internal network immedately. It is likely that your test is so stable that it doesn't reproduce the rare race condition we are looking at

Brody•11mo ago

my thoughts exactly, but 6 or 7 errors is something, though that's only, what? 0.0010% error rate I'm seeing

johnsOP•11mo ago

@Brody redis.railway.internal has AAAA record fd12:f325:1eed::32:ffc6:b2dc

Brody•11mo ago

that's the correct value you've made that lookup in code at the exact same time the pool is created right?

thomas•11mo ago

Yes, this confirms the DNS is fine and our update is more relevant, we are still working on it sorry

johnsOP•11mo ago

yes @thomas Would love to hear what the postmortem was just out of curiosity once it's settled

thomas•11mo ago

we are hammering it out currently

johnsOP•11mo ago

Though I'm still not sure 100% that there's nothing I can improve from our side - looking to see how we can manage things better

Brody•11mo ago

super long shot, in node there's a redis package called ioredis and it has a setting called family where you set it to either use ipv6 or ipv4, maybe your redis client in python has something similar? where you could default it to ipv6

johnsOP•11mo ago

I've had the ipv4 ipv6 issue with datadog before but I don't think that's the root cause here

thomas•11mo ago

We have checked the DNS logs for the host that was running the application which exhibit the Name or service not known. errors and found no issues there. We've asked for you to log the DNS resolution in your app to also help confirm if it is DNS related. However, given that the issue occurred only after a restart we suspect this might be related to a very rare race condition in private networking that we have recently discovered and you are one of the first customers to also mention it. A patch was in the works already but we have our engineers prioritizing this to get it out right away. As the patch is being deployed we also have a platform engineer on stand-by throughout this process to troubleshoot more and provide support to myself and you. We have some mitigations you can try for the interim: We recommend you delay connecting to the private network 3 seconds after your service starts as the race condition comes from the private networking interfaces being modified by the kernel when they haven't finished initializing their routing tables but receive traffic. We'd like this change to be removed after since it's not ideal to wait 3 seconds for your app to start but should mitigate the issue entirely. If you have an application that gets into this state please let us know as we have some tooling for us to go in and check if the issue is a result of the race condition we suspect. Just adding that if you don't implement the 3 second delay, that's your call because this fix should be rolling out today. We just wanted to put that there to give you options Probably not this but I honestly couldn't tell you if it affects this race condition or not.

macwilko•11mo ago

smol question will this fix mean that I won't need the 3 second delay before connecting to private Redis in future? I have a Go api and I'd very much like to remove the 3 second delay I poggers

thomas•11mo ago

Wait I didn't read this. You should never had needed that long of a delay. We try and bring the private network up in 100ms. We recommend that you just retry your connections a little more generally after the server starts. This race condition is specifically for services that restart under a lot of load then flood the private network, which then breaks in a way that doesn't resolve until the service restarts Also Brody said this, which means we might be missing our 100ms mark, but still not 3s It's going to depend on a bunch of things

Brody•11mo ago

3 seconds is a good bet (data from 2 days)

macwilko•11mo ago

brodys figures more accurate, I wish it was 100ms...

Brody•11mo ago

it will be 0, but thats off topic

johnsOP•11mo ago

Just to double check - was it a regression introduced around the range that we found (Jan 16th 18pm pst)? Or did we walk into it somehow @thomas out of curiosity

thomas•11mo ago

We are still unsure, it's hard to eliminate any change to the kernal, our host OS in general. But we didn't find a smoking gun

char8•11mo ago

Heya - so the issue here seems to be a very narrow race between when there's a high volume of traffic while we are initialising privnets. There was a gap between two syscalls in that init process where a packet of traffic could get the IPv6 neighbour table into a bad state. We only spotted this yesterday cos' we built a new chaos monkey system to stress test with frequent deploys and monitor network connectivity. The race existed since we launched the feature, and no users previously reported it. It's unlikely it was a regression, and more probably that you ran into it by accident. Most probably because the containers crashed and restarted into a high volume of traffic; redeploys etc... get a gradual traffic ramp so it's very unlikey to trigger. We're working on a privnets v2 - rewrite of the BPF programs underneath, more control over injection, and a bunch of other improvements to the overall runtime. It's been a slow process to rollout because it's complex and has a wide impact, it's not available to users atm - we're slowly landing it in parts - I'll ping back here once that's launched (don't want to commit to timelines until I get it out to Beta), but I'm actively on that 100% of my time.

Solution

thomas•11mo ago

If this happens again, please reopen this thread.

johnsOP•11mo ago

@thomas The error is actually ongoing right now since 16:14

Adam•11mo ago

what time zone?

thomas•11mo ago

Good Morning my time.

thomas•11mo ago

I did a quick check of the logs and noticed that it did happen again starting around 1600 and ending around 2000. It hasn't seem to have happened again after that is my filter not showing this properly or do you observe that as well? I also would like to note there was some memory spikes the day after on the 20th around redploys. but over all your memory useage has smoothed out. I can't tell if this is from replicas not restarting or if you just experance a smoother load over weekends

Mig•11mo ago

We are still interested to hear about how this issue might still be impacting your services so please let us know if the issue is persisting. I know you mentioned the error is still on going but Thomas mentioned he couldn't verify that. We know for sure that the race condition we suspected you were hitting is in fact resolved and now it's just confirming if that was the issue you ran into.

johnsOP•10mo ago

The issue is still persisting and we ended up just adding a sleep There's a separate service for which we're experiencing the same issue of the internal host not being found

theodor•10mo ago

Yeah we're trying to connect to the private URL of the new redis server for the migration and that's not being found: 2024-01-26 20:16:35 [line-worker] error: Line worker error: Error.getaddrinfo ENOTFOUND redis.railway.internal. Error: getaddrinfo ENOTFOUND redis.railway.internal cc @Brody @Mig @thomas

Mig•10mo ago

do you have a project with that error right now ? if so I can quickly confirm if it's the issue we expect

theodor•10mo ago

yeah i had to roll back though

Mig•10mo ago

understandable

theodor•10mo ago

Deployment ID: b2888f70-317a-404c-8910-8702eb63bc36

Mig•10mo ago

Do you have a delay in the start of your app to avoid the race condition ?

theodor•10mo ago

I thought that was solved? Should I still add it back in the code? Maybe it was a miscommunication We thought that the 3s delay was no longer necessary

Mig•10mo ago

no you understood correctly. We did deploy a patch for that but if you're still experiencing the issue it might be something else or the patch was not right (highly doubt this, we saw the bug, did patch, no longer could reproduce)

theodor•10mo ago

I can try to quickly redeploy one of our dummy services for testing with a 3s delay

Mig•10mo ago

if there isn't too much impact by having the 3 second delay I would add it for now to try and mitigate the error.

Brody•10mo ago

from my data, a 3 second delay is still necessary as the internal dns server will not respond to any dns lookups until after about 2-3 seconds

theodor•10mo ago

ah ack! thanks brody

Mig•10mo ago

Brody - the patch prevents unrecoverable state but I believe you're right that DNS could (seems often) fail for the first 3 seconds.

theodor•10mo ago

but actually let me check this our worker will attempt to reconnect for ever even if it fails at first shouldn't it eventually connect it was failing for like a minute

Brody•10mo ago

over 7 days the average internal dns resolver ready time is 1.56s with some very intermittent spikes to 5+ seconds

theodor•10mo ago

but if it's ready after 2s, shouldn't the reconnection work after 2s or 5s

Mig•10mo ago

yeah, 100% it should

theodor•10mo ago

is this because it's using the original new Redis(REDIS_URL) client?

Mig•10mo ago

if it does not that's a signficant bug. The 3 second startup is a caveat that can be worked around but DNS resolution failing after that is not acceptible.

theodor•10mo ago

i'm wondering if it's a detail of how ioredis on node handles thiw this*

Brody•10mo ago

oh wait you're using ioredis? I thought this was python

Mig•10mo ago

if the retry mechanism re-uses the broken config I could see how future attempts just continues to error

theodor•10mo ago

ya @Brody should i specify ?family=6

Brody•10mo ago

do you have family set to 6?

theodor•10mo ago

? ahhhhh i see

Brody•10mo ago

absolutely otherwise ioredis will only use ipv4

theodor•10mo ago

i see this explains everything i'm sorry then that's not your fault it's ioredis

Brody•10mo ago

I could have swarn this was a python app we where talking about so ioredis didn't cross my mind

theodor•10mo ago

haha that is our other service my bad

Brody•10mo ago

is that service still experiencing problems?

Mig•10mo ago

checkout the caveats on private networking https://docs.railway.app/reference/private-networking#caveats setting ioredis to use ipv6 is not obvious from our caveats page but might be helpful to know all the known issues

Railway Docs

Private Networking | Railway Docs

Documentation for Railway

theodor•10mo ago

@Brody i think we had to add the sleep for the rizz service @Mig thank you!

Brody•10mo ago

Melissa has added a thing about ioredis elsewhere in the docs

Mig•10mo ago

ohhh you're rizzgpt. I saw that project and thought it was an interesting name. :p

Brody•10mo ago

it's a very cool product too!

theodor•10mo ago

haha thank you! the node issue is for heycami.com which is even higher usage

Mig•10mo ago

try the ipv6 setting to ensure the the correct dns resolution is used and report back

theodor•10mo ago

on it!

Mig•10mo ago

I'm really trying to fix this issue for y'all Clarification that the patch we deployed prevents a unrecoverable state for the private networking routes. There is still currently a ~3 second startup delay where DNS resolution may fail. If the 3 second delay is not too impacting I'd recommend leaving it. We are working on removing that startup delay this quarter because it is definitely unreasonable to have you workaround that. Just so I understand the impact on you, has this 3 second delay mitigated the issue ? DNS resolution should work after the first few seconds.

johnsOP•10mo ago

For rizzgpt, yes. We'll see if the ipv6 and the delay fixes things for cami as well

Brody•10mo ago

the 3 second delay won't have any impact as long as you are using a readiness healthcheck!

Mig•10mo ago

ok ty @ me if y'all want. I'm going back to working on our new edge proxy 🙂

theodor•10mo ago

using family 6 did the job! (forgot to mention earlier)

Brody•10mo ago

awesome

Mig•10mo ago

thanks Brody 🐐

Gaming

Programming

Did something related to private networking fundamentally change in the past couple of days?