Did something related to private networking fundamentally change in the past couple of days?
We have a simple Django app and use Railway's Redis.
Over the past couple of days, we've been seeing the following happen:
Deploy the server -> In a couple hours, we start getting
ERROR:django_datadog_logger.middleware.error_log:Error -2 connecting to redis.railway.internal:6379. Name or service not known
where all connection attempts to Redis fails.
Our code hadn't changed and we just wanted to double-check if there's anything fundamentally changed with the network layer.Solution:Jump to solution
@unstoppablerizz
The fix for this race condintion has been merged and deployed. This should not happen to you again, if you want to remove the 3 second delay (assuming you added one). You can but as brody showed us, that will generally just result in errors. However, if Django already takes long enough you might not need an additional delay....
176 Replies
Project ID:
120b5ec5-59d8-4087-84ae-4e0b3d934aa7
120b5ec5-59d8-4087-84ae-4e0b3d934aa7
does it connect immediately to the thing on Private Networking?
I have found it's needed to add a short 3 second delay, before connecting on Private Networking
Yeah, initial couple hours is not the issue and the delay we need in the beginning is not the issue.
I don't believe anything has fundamentally changed with Private networking
if it was previously working it could of been good luck maybe
Didn't you guys have an outage with k8s 503s?
@macwilko We've been running this with quite a lot of users for the past 9 months so I think good luck is a stretch
We haven't touched the code there for over a week which is quite suspicious. Only started happening after the 503 issue
Can we escalate this somehow and get a bit better treatment? We're paying quite a bit to railway after all
@thomas - would you know or could you check to see if there where any changes with the private network or any of the surrounding infrastructure?
my testing indicates zero behavior changes, but who knows
Thanks @Brody!
We're moving over to Azure Redis for now
Do you guys use Azure or AWS? We can use either but we'd prefer to be on the same cloud provider
cc @Brody
railway runs on GCP at the moment
do you see similar problems if you connect over the public url?
I am looking at this now. But the next level of support is direct support.
you are using the private url
@thomas is this not the way to get that? or how should we tap into it
Right we're using the REDIS_PRIVATE_URL. My fault
So at the moment, we have 3 tiers of support, community, team, and direct. They correspond with Hobby, Team, and what I would call enterprise. Direct support you get a private channel with the team on slack or discord.
I remember - when we used the public url, the latency went up by 5~10x when we migrated to v2 from v1 in december, which is why I specifically changed to use the private url. Using the private url brought down the latency
@unstoppablerizz Which service is throwing thiese logs.
And yes using the private URL is recommended for links between all services.
Service ID d89b8d57-7c05-4bab-bf8a-27bc11f78cbc
Basically, after those spikes happen, the connection to redis doesn't seem to work at all
I was trying to look through the logs but I keep deploying over the old instance so they get truncated
that screen shot is from the reddit instance?
No, it's the screenshot from Service ID d89b8d57-7c05-4bab-bf8a-27bc11f78cbc which is our Django server instance
Maybe it’s that we’re exhausting the max simultaneous connection limit of Redis somehow? Do you guys monitor this
i dont think you are opening 10k conns
That's not a bad way path to consider. Do all connections drop or just some? You said that all connection attempts to redis fail in the first part of this.
Considering the replications you have running and the scale of the CPU you are using it's possible
side question, are you deplying your app with nixpacks, or a dockerfile?
I am bringing this up internally. Brody also is one of most helpful community members.
i'm pretty sure this is char's wheelhouse, he would instantly know if there was any changes to the internal dns resolver
@unstoppablerizz Is there a reason all your logs for this service are piped to stderr? the red wall is not what I would consider normal logging practices and honestly sent a shiver down my spine. (this is unrelated to your issue, but I am honestly curious)
Dockerfile
I have been made to understand that is standard for some libs
is it based on an alpine image?
Are you esablishing a new reddit connection for each API request or using a pool of connections?
We have datadog so that’s how we set up monitoring / tracing / alerts, we don’t really read the server logs on Railway because it’s not really usable
Let me double check
We’re using a pool
So you shouldn't be running out of connections unless your pool is over 10000?
Yeah, at your scale, we understand that, and this is one of the things that we will be improving.
When we started seeing the issues, we were using the pool + Django’s default cache backend module which I believe does pool connections
Though even if we don’t use a pool, I don’t think our usage is that much to warrant this issue. And besides, our load didn’t drastically change from last week
The most recent failure (one at 1:40am) could be because I explicitly defined connection pool of 2000 which may have exhausted the # of connections, but this wasn’t the case before last night
where you able to check?
We're using
python:3.11
base image. Checking if that's alpinethats debian based
there goes that idea lol
Oh I think I saw the spike again a couple tens of minutes ago
Do you have a graph of the freaqueny of the error from data dog?
Yeah the tricky thing with this is that it doesn't get traced properly
We unfortunately only have the wall of red
But since it's pretty fresh, you should be able to see the socket errors happen from that point
starting 10:06am (first bump)
If not, then after 10:34am
I have paged infrastructure on-call, and they are looking into this as well now. We are attempting to determine if this is an incident or not.
We're also not logging everything to stderr - we're only logging errors to stderr (have log level set to ERROR)
Understood
But 400s are also "logged" as errors which can be safely ignored most of the time
I see what you mean by intermitent
trying to look at what happens at 10:08:20
Because I think that's when it started again
Just a quick feedback - I don't think our multiline logs are actual individual logs, it's just newlines on a trace. Probably what makes me never want to use the log because it's impossible to trace
I didn't bother setting a jsonformatter
I do notice something odd - our server seems to restart right before that happens. Not sure why
?: (staticfiles.W004) The directory '/code/static' in the STATICFILES_DIRS setting does not exist.
<- this only happens on server start
Also pagination on search -> load more is wrong. Likely some issue with cursor based paginationat what point in your apps lifecycle is the redis pool created? at start?
Defined in the global scope, yes
ok, that's a good clue
at startup?
Python Redis pool is threadsafe so it should be fine
yeah...perhaps the restarts somehow eat up the connections
The private network takes time to initialize after a server restart.
But the thing is that it should go away then
yeah I don't think that's the issue
my current thinking is that their app is crashing, then re-creating the pool on startup before the network initiation has finished, thus resulting in dns errors
The thing is our code hadn't changed before this started happening afaik
I'll double check my commits
the spikes are a good sign that the app crashed
That's a counterpoint, but is this load level reaching new heights?
If you search for "System check identified some issues" in the logs, you can see that it's sparingly found (only happens on deploy). Starting Jan 16th, it happens a lot more even when we're not deploying.
No not at all
Barely more, and we've had much more before
Though double-checking
do you have a 3 second sleep before you are creating the redis pool on startup?
We don't but startup already takes ~10 seconds anyways for django since we run migration on server start before we initialize the server, so there's no way there's a connection issue
Besides it never was an issue
basically all the random restarts here I think could be the source of the issue. We hadn't done any deploys at that time, so my guess is something about Railway's infra changed at that time
i have historical data and the network intonation times have not changed, still sitting at an average of 1.57 seconds
definitely worth looking into why your app is crashing
It could be something about a k8 setting that you guys are using internally, or how hardware is allocated for the pods
Though I have no idea about the internals
Do you have time for that time you reference?
Exactly around Jan 16 18pm. Not sure this is UTC
The only new deployment I see from our side after Jan 13th is on 17th, and I think it was a reaction to the errors we were seeing
We do ask that initial private network attempts be retried. In the first 100ms: docs
side note, railway doesn’t use k8s for any user workloads, and as far as i know they haven't changed the runtime yet
I am looking at our commit PR log to check
I think if you look back and forth 2~3 hours from there, maybe something can be found? But I'll also continue diagnosing
We recommand you change your restart policy to restart always for your API service
We ran into a strange condition where one of your replicas was in an exited state.
ok changed it
Still looking into other things on our side
just something that was noticed
where did you get this date, and what would your local time be if this isn't UTC
The date is from the Railway log itself
so whatever it is
Ok it seems like it's PST based given what I see in the most recent log matching my timezone
so Jan 16 18PM PST
i believe logs should be in UTC on Railway
I meant just from my screenshot
I see this in the most recent screenshot, which means the display is based on my current timezone, so given that the screenshot that I was referring to for when the errors were starting has ~Jan 16 18PM, thomas should look at the commits using the PST.
I'm sure on the backend they save it as UTC
Yes, but we do convert it to local in the UI!
So that time is in PST
One of our infra engineers asks if you can you modify the code to log the IP of redis.railway.internal when you init a connection so we can confirm if it's DNS related or not? If you put this in where the connections are started in the pool that should avoid flooding the logs.
😓 I've done this before but it was a lot of trial and error - do you happen to have the python code for it
https://www.tutorialspoint.com/python_network_programming/python_dns_look_up.htm you want to lookup the
AAAA
valueYou can share your connection code and Brody or I might be able to help
Just double checking - am I supposed to look up for
redis.railway.internal:6379
?just the hostname
no, that's an internal log sorry
We're only setting the REDIS_URL, which contains everything
redis://<user>:<password>@<host>:<port>
so
If that works sometimes, it should always work
@Brody .query is deprecated
perhaps this works?
Though I'm pretty skeptical that it's going to work on the entire redis connection string
please only lookup the host
monorail.proxy.rlwy.net
just this right?thats the public host
I know, just the form of it
ah then yes, thats a hostname
@unstoppablerizz can you tag me when this happens again
we have an idea we would like to try but we need it to be happening
sure
I have a larger updating coming, we did find something. Just confirming best next steps with the team
@Conductor This user can override rule 5 today and tag me directly.
for complete transparency Thomas, in my testing, over the time span of 7 days, and 600k successful internal dns lookups I get about 6 to 7 failed dns lookups. though this doesn't explain their issues, and my test doesn't cover all the deployed resolvers, just thought I'd mention it
To reproduce this you would have to crash your container then use the internal network immedately.
It is likely that your test is so stable that it doesn't reproduce the rare race condition we are looking at
my thoughts exactly, but 6 or 7 errors is something, though that's only, what? 0.0010% error rate I'm seeing
@Brody redis.railway.internal has AAAA record fd12:f325:1eed::32:ffc6:b2dc
that's the correct value
you've made that lookup in code at the exact same time the pool is created right?
Yes, this confirms the DNS is fine and our update is more relevant, we are still working on it sorry
yes
@thomas Would love to hear what the postmortem was just out of curiosity once it's settled
we are hammering it out currently
Though I'm still not sure 100% that there's nothing I can improve from our side - looking to see how we can manage things better
super long shot, in node there's a redis package called ioredis and it has a setting called family where you set it to either use ipv6 or ipv4, maybe your redis client in python has something similar? where you could default it to ipv6
I've had the ipv4 ipv6 issue with datadog before but I don't think that's the root cause here
We have checked the DNS logs for the host that was running the application which exhibit the Name or service not known. errors and found no issues there. We've asked for you to log the DNS resolution in your app to also help confirm if it is DNS related.
However, given that the issue occurred only after a restart we suspect this might be related to a very rare race condition in private networking that we have recently discovered and you are one of the first customers to also mention it. A patch was in the works already but we have our engineers prioritizing this to get it out right away.
As the patch is being deployed we also have a platform engineer on stand-by throughout this process to troubleshoot more and provide support to myself and you. We have some mitigations you can try for the interim:
We recommend you delay connecting to the private network 3 seconds after your service starts as the race condition comes from the private networking interfaces being modified by the kernel when they haven't finished initializing their routing tables but receive traffic. We'd like this change to be removed after since it's not ideal to wait 3 seconds for your app to start but should mitigate the issue entirely.
If you have an application that gets into this state please let us know as we have some tooling for us to go in and check if the issue is a result of the race condition we suspect.
Just adding that if you don't implement the 3 second delay, that's your call because this fix should be rolling out today. We just wanted to put that there to give you options
Probably not this but I honestly couldn't tell you if it affects this race condition or not.
smol question will this fix mean that I won't need the 3 second delay before connecting to private Redis in future?
I have a Go api and I'd very much like to remove the 3 second delay I
Wait I didn't read this. You should never had needed that long of a delay. We try and bring the private network up in 100ms.
We recommend that you just retry your connections a little more generally after the server starts.
This race condition is specifically for services that restart under a lot of load then flood the private network, which then breaks in a way that doesn't resolve until the service restarts
Also Brody said this, which means we might be missing our 100ms mark, but still not 3s
It's going to depend on a bunch of things
3 seconds is a good bet (data from 2 days)
brodys figures more accurate, I wish it was 100ms...
it will be 0, but thats off topic
Just to double check - was it a regression introduced around the range that we found (Jan 16th 18pm pst)? Or did we walk into it somehow
@thomas out of curiosity
We are still unsure, it's hard to eliminate any change to the kernal, our host OS in general. But we didn't find a smoking gun
Heya - so the issue here seems to be a very narrow race between when there's a high volume of traffic while we are initialising privnets. There was a gap between two syscalls in that init process where a packet of traffic could get the IPv6 neighbour table into a bad state.
We only spotted this yesterday cos' we built a new chaos monkey system to stress test with frequent deploys and monitor network connectivity. The race existed since we launched the feature, and no users previously reported it. It's unlikely it was a regression, and more probably that you ran into it by accident. Most probably because the containers crashed and restarted into a high volume of traffic; redeploys etc... get a gradual traffic ramp so it's very unlikey to trigger.
We're working on a privnets v2 - rewrite of the BPF programs underneath, more control over injection, and a bunch of other improvements to the overall runtime. It's been a slow process to rollout because it's complex and has a wide impact, it's not available to users atm - we're slowly landing it in parts - I'll ping back here once that's launched (don't want to commit to timelines until I get it out to Beta), but I'm actively on that 100% of my time.
Solution
@unstoppablerizz
The fix for this race condintion has been merged and deployed. This should not happen to you again, if you want to remove the 3 second delay (assuming you added one). You can but as brody showed us, that will generally just result in errors. However, if Django already takes long enough you might not need an additional delay.
If this happens again, please reopen this thread.
@thomas The error is actually ongoing right now
since 16:14
what time zone?
Good Morning my time.
I did a quick check of the logs and noticed that it did happen again starting around 1600 and ending around 2000. It hasn't seem to have happened again after that
is my filter not showing this properly or do you observe that as well?
I also would like to note there was some memory spikes the day after on the 20th around redploys. but over all your memory useage has smoothed out. I can't tell if this is from replicas not restarting or if you just experance a smoother load over weekends
We are still interested to hear about how this issue might still be impacting your services so please let us know if the issue is persisting. I know you mentioned the error is still on going but Thomas mentioned he couldn't verify that. We know for sure that the race condition we suspected you were hitting is in fact resolved and now it's just confirming if that was the issue you ran into.
The issue is still persisting and we ended up just adding a sleep
There's a separate service for which we're experiencing the same issue of the internal host not being found
Yeah we're trying to connect to the private URL of the new redis server for the migration and that's not being found:
2024-01-26 20:16:35 [line-worker] error: Line worker error: Error.getaddrinfo ENOTFOUND redis.railway.internal. Error: getaddrinfo ENOTFOUND redis.railway.internal
cc @Brody @Mig @thomas
do you have a project with that error right now ?
if so I can quickly confirm if it's the issue we expect
yeah i had to roll back though
understandable
Deployment ID: b2888f70-317a-404c-8910-8702eb63bc36
Do you have a delay in the start of your app to avoid the race condition ?
I thought that was solved? Should I still add it back in the code?
Maybe it was a miscommunication
We thought that the 3s delay was no longer necessary
no you understood correctly. We did deploy a patch for that but if you're still experiencing the issue it might be something else or the patch was not right (highly doubt this, we saw the bug, did patch, no longer could reproduce)
I can try to quickly redeploy one of our dummy services for testing
with a 3s delay
if there isn't too much impact by having the 3 second delay I would add it for now to try and mitigate the error.
from my data, a 3 second delay is still necessary as the internal dns server will not respond to any dns lookups until after about 2-3 seconds
ah ack! thanks brody
Brody - the patch prevents unrecoverable state but I believe you're right that DNS could (seems often) fail for the first 3 seconds.
but actually
let me check this
our worker will attempt to reconnect for ever
even if it fails at first shouldn't it eventually connect
it was failing for like a minute
over 7 days the average internal dns resolver ready time is 1.56s with some very intermittent spikes to 5+ seconds
but if it's ready after 2s, shouldn't the reconnection work after 2s
or 5s
yeah, 100% it should
is this because it's using the original
new Redis(REDIS_URL)
client?if it does not that's a signficant bug. The 3 second startup is a caveat that can be worked around but DNS resolution failing after that is not acceptible.
i'm wondering if it's a detail of how
ioredis
on node handles thiw
this*oh wait you're using ioredis? I thought this was python
if the retry mechanism re-uses the broken config I could see how future attempts just continues to error
ya @Brody
should i specify
?family=6
do you have family set to 6?
?
ahhhhh i see
absolutely
otherwise ioredis will only use ipv4
i see
this explains everything
i'm sorry
then
that's not your fault
it's ioredis
I could have swarn this was a python app we where talking about so ioredis didn't cross my mind
haha that is our other service
my bad
is that service still experiencing problems?
checkout the caveats on private networking https://docs.railway.app/reference/private-networking#caveats
setting ioredis to use ipv6 is not obvious from our caveats page but might be helpful to know all the known issues
@Brody i think we had to add the sleep
for the rizz service
@Mig thank you!
Melissa has added a thing about ioredis elsewhere in the docs
ohhh you're rizzgpt. I saw that project and thought it was an interesting name. :p
it's a very cool product too!
haha thank you!
the node issue is for heycami.com
which is even higher usage
try the ipv6 setting to ensure the the correct dns resolution is used and report back
on it!
I'm really trying to fix this issue for y'all
Clarification that the patch we deployed prevents a unrecoverable state for the private networking routes. There is still currently a ~3 second startup delay where DNS resolution may fail. If the 3 second delay is not too impacting I'd recommend leaving it. We are working on removing that startup delay this quarter because it is definitely unreasonable to have you workaround that. Just so I understand the impact on you, has this 3 second delay mitigated the issue ?
DNS resolution should work after the first few seconds.
For rizzgpt, yes. We'll see if the ipv6 and the delay fixes things for cami as well
the 3 second delay won't have any impact as long as you are using a readiness healthcheck!
ok ty
@ me if y'all want. I'm going back to working on our new edge proxy 🙂
using family 6 did the job!
(forgot to mention earlier)
awesome
thanks Brody 🐐