R
Railwayā€¢7mo ago
sina

Service Returns Sporadic 503s, due to DDOS protection?

My api service periodically returns the railway error html page with a 503 code. I believe this issue is on the railway side, as I don't see it crop up locally in similar testing scenarios. In my case I'm testing the frequency I can call my API; are there inherent rate limits the railway platform enforces? I poked around in the doc and couldn't find anything related to this. I'm testing with ~hundreds of requests per sec
Solution:
For context, railway does ddos protection for themselves, not for an individual users sake. a 503 simply means your app failed to respond properly to the request....
Jump to solution
76 Replies
Percy
Percyā€¢7mo ago
Project ID: 21b7eb49-e84c-40c9-9278-73e17ff7b318
sina
sinaOPā€¢7mo ago
21b7eb49-e84c-40c9-9278-73e17ff7b318
Solution
Brody
Brodyā€¢7mo ago
For context, railway does ddos protection for themselves, not for an individual users sake. a 503 simply means your app failed to respond properly to the request.
sina
sinaOPā€¢7mo ago
Yup, that's what I expect. The only issue is I'm not seeing any indication of an issue on my server :thinkies: It's a pretty standard express server, it's doing some stuff but I'd expect those things to have some indication when they fail. My suspicion is the request isn't even getting to my server
Brody
Brodyā€¢7mo ago
can you reproduce this with something better than express?
sina
sinaOPā€¢7mo ago
Debugging a prod issue so a smaller repro may be on the backburner. Understand if this means this help thread is lower prio šŸ«”
Brody
Brodyā€¢7mo ago
ticket?
sina
sinaOPā€¢7mo ago
Meant this thread šŸ«”
Brody
Brodyā€¢7mo ago
there's unfortunately not much I can do for you here, as hobby users are only eligible for community support
sina
sinaOPā€¢7mo ago
Gm, sorry to revive an old thread but we're still seeing this issue, and across multiple of our services we run on railway. tldr is very sporadic 503s. They only seem to repro under load, so it's hard to have a consistent repro. But we've received enough reports from different external parties across many of our services to suspect it's related to railway, or we're doing something else wrong. I'm seeing this issue across apps that are running on both express and next.js. I did some digging in past help threads and looks like there are a decent number of threads with this type of issue: - https://discord.com/channels/713503345364697088/1148157220777885757 - https://help.railway.app/questions/occasionally-getting-503-responses-expr-57b386b5 - https://help.railway.app/questions/deployed-app-responds-randomly-with-503s-07e53c87 In one of the threads there's strong confidence built that the 502 at least seems to be a payload something along railway's side is returning, and in the last one it's suggested for users to run their own proxy next to externally exposed services. Given that I'm not seeing any error logs across my services, I wanted to see if there are any other suggested workarounds?
sina
sinaOPā€¢7mo ago
Also, I'm on the "Pro Plan", I just have my discord linked to my personal railway account vs my "work" one. Happy to redo that link if it's useful. šŸ˜„ Appreciate any advice on this nefarious issue :salute:
No description
Brody
Brodyā€¢7mo ago
do you currently utilize cloudflare?
sina
sinaOPā€¢7mo ago
Yes
Brody
Brodyā€¢7mo ago
have you tried switching to cloudflared?
sina
sinaOPā€¢7mo ago
have you tried switching to cloudflared?
No I haven't, from a quick look it seems relevant, thanks for the pointer. Have you used it in the past with railway with success? I made a dead-simple hono web server and am observing the same behavior. If I ramp up my "load test" to ~200 reqs per second, I get pretty consistent failures after not too long, with no indication of errors on my server-side. One hypothesis I have is that there's an issue with the server scaling up quickly to handle load; the metrics on my server show it barely raising its consumption, despite the start of the 503 errors. I'd expect the memory/cpu to creep up and hit some limit before the 503s creep in.
Brody
Brodyā€¢7mo ago
if i gave you an endpoint to test, would you be able to test it?
sina
sinaOPā€¢7mo ago
My test script is very simple:
const honoUrl = "https://hono-test.up.railway.app/";

// ~200 reqs per second.
const concurrency = 100;
const delayMs = 500;

let errors = 0;
let numReqs = 0;

const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));
const makeTestRequests = async () => {
try {
const response = await fetch(honoUrl);
const data = await response.json();
} catch {
errors++;
}
numReqs++;
};

while (true) {
const start = Date.now();
// Make a bunch of requests concurrently.
await Promise.all(
Array.from({ length: concurrency }, () => makeTestRequests()),
);
const now = Date.now();
const elapsed = now - start;
console.table({ errors, numReqs, elapsed, now });
// Wait before going again.
await delay(delayMs);
}

export type {};
const honoUrl = "https://hono-test.up.railway.app/";

// ~200 reqs per second.
const concurrency = 100;
const delayMs = 500;

let errors = 0;
let numReqs = 0;

const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));
const makeTestRequests = async () => {
try {
const response = await fetch(honoUrl);
const data = await response.json();
} catch {
errors++;
}
numReqs++;
};

while (true) {
const start = Date.now();
// Make a bunch of requests concurrently.
await Promise.all(
Array.from({ length: concurrency }, () => makeTestRequests()),
);
const now = Date.now();
const elapsed = now - start;
console.table({ errors, numReqs, elapsed, now });
// Wait before going again.
await delay(delayMs);
}

export type {};
Brody
Brodyā€¢7mo ago
Have you used it in the past with railway with success?
yep! set it up for a user a while ago and they have yet to tell me about any issues, but it wasnt like they where having issue with railway to begin with
sina
sinaOPā€¢7mo ago
My hono server is dead-simple too:
import { Hono } from "hono";

const app = new Hono();

app.get("/", (c) => {
return c.json({ now: Date.now() });
});

export default {
fetch: app.fetch,
port: process.env.PORT || 3000,
};
import { Hono } from "hono";

const app = new Hono();

app.get("/", (c) => {
return c.json({ now: Date.now() });
});

export default {
fetch: app.fetch,
port: process.env.PORT || 3000,
};
Brody
Brodyā€¢7mo ago
okay, one minute, ill have an endpoint for you to test
sina
sinaOPā€¢7mo ago
I can also try standing up the cloudflared if you have a quick tldr guide šŸ‘€
Brody
Brodyā€¢7mo ago
i dont, but i can walk you through it after my dinner https://utilities.up.railway.app/now
sina
sinaOPā€¢7mo ago
Just stepped afk, but assuming this will be up later will give it a test tonight :salute:
Brody
Brodyā€¢7mo ago
it wasnt up when I sent it?
sina
sinaOPā€¢7mo ago
It was up, I just stepped afk briefly I'm running my script against it now, I get a pretty high number of failures :thinkies:
Brody
Brodyā€¢7mo ago
oh im very dumb
sina
sinaOPā€¢7mo ago
For reference, here's my output my endpoint (still higher error rate than I'd want)
Brody
Brodyā€¢7mo ago
theres a rate limiter
sina
sinaOPā€¢7mo ago
Oh lol, lmk if you can disable and I can try again Btw one Q for your cloudflared suggestion, would this mean I have to pay for egress to both railway and cloudflare? I haven't used cloudflare that deeply aside from just managing my domain/dns
Brody
Brodyā€¢7mo ago
you would have to pay egress for the cloudflared service you deploy to railway, and i dont think cloudflare charges for egress? its a rate limiter for every endpoint in the app, so ill have to do research on how to disable it for a single endpoint figured it out, building now i guess im not building okay changes are live now feel free to run the test again how did the test go? @sina sorry for the ping, but the new edge proxy just went into beta and id like to see if enabling that magically fixes the errors you see when doing your test it will take railway's old and slow envoy proxy out of the mix and replace it with a fully custom in house built proxy
sina
sinaOPā€¢7mo ago
Amazing, sorry ser I'm still tracking this just afk for the weekend. Took a peep at my settings though and it doesn't let me switch the toggle on?
Brody
Brodyā€¢7mo ago
haha its a little unresponsible, i had to hover it for several seconds before it let me interact
sina
sinaOPā€¢7mo ago
Nvm the toggle is bugged but looks like it worked! Yup Switched it on, will monitor. Was planning to run more tests next week Appreciate you ser šŸ˜
Brody
Brodyā€¢7mo ago
you did run more tests on my endpoint after i removed the rate limit on the now route, did you get any errors?
sina
sinaOPā€¢7mo ago
No errors from yours but only did limited testing, want to do some more. Played with the new edge proxy and it seems my local testing quickly hits a 429: Too many requests error! Curious if this is documented anywhere so I can understand how it works (guessing an IP-based sanity rate limit, would be good to understand though)
Brody
Brodyā€¢7mo ago
okay so its not strictly an issue with railway since its in beta, theres no docs around it yet, i will tag in mig (the person who spent a year developing it) here monday for extra insight
sina
sinaOPā€¢7mo ago
I have a vanilla express service that also returns just the simple date and it was having pretty frequent 503 errors crop up. I wanted to do more testing before drawing any conclusions šŸ™ šŸ«” Glad to get to beta test it šŸ˜ I'm very optimistic
Brody
Brodyā€¢7mo ago
how many req/s did you say your test does?
sina
sinaOPā€¢7mo ago
This config errors out pretty quickly. It's not perfect reqs per sec because I'm using my own hacky script, but basically I fire off 100 requests and after they finish wait an additional 200ms before going again. So not accounting for the request time, this is ~500req/sec Was considering spending more time min/maxing it, but may wait until Monday if there's a chance we can just get shortcut'd to the right answer via the inside word
Brody
Brodyā€¢7mo ago
500 req/s is a lot, but its a bit too soon for railway to be returning 429 imo will see what mig has to say
sina
sinaOPā€¢7mo ago
I just tried again at 100 req/sec and errored quickly again Yup, I'm not too concerned :salute:
Brody
Brodyā€¢7mo ago
okay we'll come back to this monday tuesday (monday was a holiday)
sina
sinaOPā€¢7mo ago
Gmgm, wanted to ping this thread, eager to hear more about the new edge proxy šŸ˜
Brody
Brodyā€¢7mo ago
new load balancer? the changelog only mentioned that as a down the road possiblity, nothing has been implemented yet
sina
sinaOPā€¢7mo ago
I corrected my message to edge proxy*, sorry for the mix up, early morning šŸ˜†
Brody
Brodyā€¢7mo ago
ah gotcha I have already told mig about it, and linked this thread to him, waiting to hear back
sina
sinaOPā€¢7mo ago
Appreciate it ser šŸ˜ Hitting 429 rate limit responses @ <20 reqs/sec as well (which seems quite low...)
Brody
Brodyā€¢7mo ago
still waiting to hear back from mig hey @pro sorry for the delay, some changes where made and you should now be able to do a max of 1k RPS on the new proxy
sina
sinaOPā€¢6mo ago
Amazing, will continue my testing then, thanks šŸ˜ Had the following occur today: - service has two URLs assigned to it-- one up.railway.app one and one via cloudflare - it redeployed via github push - newly deployed service didn't serve on the cloudflare URL, but showed up fine the up.railway.app one - attempted to clear cloudflare cache and redeploy; finally seems to be fixed after I disabled the new edge proxy Going to play with flipping the edge proxy back on tomorrow, but wanted to flag. Here's a failed request ID for one that was failing to load via the cloudflare URL while the up.railway URL was working: GAhb00FYS6SLQefqUm3Aeg_1861343781
Brody
Brodyā€¢6mo ago
what was the reason for the failure?
sina
sinaOPā€¢6mo ago
No clue, that's why I pinged the thread šŸ˜…
Brody
Brodyā€¢6mo ago
the page would have told you
sina
sinaOPā€¢6mo ago
Played with re-enabling the new edge proxy and looks like it makes the services return 404, so seems like that suddenly broke for some of my services
sina
sinaOPā€¢6mo ago
No description
sina
sinaOPā€¢6mo ago
Disable edge proxy -> domain loads fine (after waiting a few min)
Brody
Brodyā€¢6mo ago
please be more specific when you say "didn't serve"
sina
sinaOPā€¢6mo ago
The above 404 screen is served^, and I don't see any corresponding error log or crash on my service. Let me know if I can be more specific šŸ˜ After disabling the edge proxy, looks like the up.railway.app domain also continues to serve via the edge proxy (can tell because it's showing me the edge proxy's 404 page, the old proxy's is different). The custom non-railway domain works fine and serves my normal app though..
Brody
Brodyā€¢6mo ago
DNS would take a bit to update, I'll see if I can reproduce this
sina
sinaOPā€¢6mo ago
I'm not seeing this issue for other services, but also now I'm a bit worried to redeploy and play with it in my other services šŸ˜… Let me know if I can provide project id or anything for help debugging, seems like an edge proxy thing to me.
sina
sinaOPā€¢6mo ago
Fwiw, here's the settings; the railway.app URL 404s (showing the edge proxy's error screen) and the cloudflare one works fine. This is a dev instance so happy to flip the settings around if it's helpful.
No description
Brody
Brodyā€¢6mo ago
sorry but this thread is only going to be for issues relating to the edge proxy but you do not currently have it enabled
sina
sinaOPā€¢6mo ago
My apologies if I'm being confusing, but I'm trying to describe general strange behavior with the edge proxy. In the case of the screenshot, I'm still getting the edge proxy's 404 page despite disabling it ~20 minutes ago. But enabling it also breaks my custom domain from working, I think that's the primary issue. So when I enable it, the railway.app domain works but the cloudflare one breaks. Maybe I just need to give more time for ~dns things to happen; will enable it and check back in the morning.
Brody
Brodyā€¢6mo ago
please keep the edge proxy on
sina
sinaOPā€¢6mo ago
Err, I have it off for essential services since they aren't working with it on šŸ˜… Btw check-in for this morning, the dev service that I flipped it on for is still in this state
Brody
Brodyā€¢6mo ago
please be more specific when you say "aren't working"
sina
sinaOPā€¢6mo ago
Hey, I reply-messaged the earlier context that should specify. If there is something specific that I'm not providing that would useful please let me know, but asking me to re-supply the same context over and over again makes the thread pretty noisy šŸ˜ It's the same error screen I'm been mentioning since my first ping yesterday
No description
sina
sinaOPā€¢6mo ago
This is with the edge proxy ON since last night. Again, the railway.app URL works, the custom cloudflare one does not.
No description
sina
sinaOPā€¢6mo ago
I have a duplicate of the same service which has its custom domain working with the edge proxy disabled, so between that and the 404 page being the edge proxy's (again, the screenshot of the not found page!!) I'm assuming the issue has to do with the edge proxy
Brody
Brodyā€¢6mo ago
please please don't worry about making the thread noisy, adding a tad bit of context in place of "won't serve" or "doesn't work" would be extremely helpful so that I know if you are referring to a previous issue or a new issue
sina
sinaOPā€¢6mo ago
Ok, well please see the above for hopefully a clear issue I'm experiencing with the new edge proxy.
Brody
Brodyā€¢6mo ago
will try to reproduce
sina
sinaOPā€¢6mo ago
I'm seeing services in another project with the edge proxy on that still have their custom domain working, so the issue may be flaky I just sanity-check flipped the edge proxy off, confirmed that the custom domain now works, then flipped it back on, and confirmed it broke again after ~a minute. Is there not an easy way to debug with the request ID? I thought that was one of the benefits of the new edge proxy šŸ¤”
Brody
Brodyā€¢6mo ago
we would need to get a team member involved, and i will, i would just like to repro first cc @Mig can't reproduce but likely worth looking into
Mig
Migā€¢6mo ago
hey all, thanks very much for the help in improving the new proxy. I'll see what I can find given the request IDs. as of right now I could curl both domains and I get a 200. I know the new proxy is used because the server: railway-edge header is present for the up.railway.app domain. Cloudflare removes this. the logs for the request id just confirm what we see in the HTML response. The domain given does not have a application. It is very weird for the up domain to work but the custom domain behind cloudflare would not. Both domains are working right now. This may be related to something happening after a deploy because maybe you re-deployed and that fixed it ? or did just toggling the proxy fix it ? Can you @ me if when this comes up again and I'll look at it right away. If I can see the app in this 404 state I can query the system more to see which part thinks there is no app. the same systems power the up and custom domain routes so it's weird that one would work and not the other unless Cloudflare is sending unexpected information. toggling the proxy should only a few minutes (the DNS ttl is 1 minute).
sina
sinaOPā€¢6mo ago
I can confirm that it's working for me now too. I've also enabled it for my prod service with success šŸ‘ Not sure what fixed it between playing with the settings, redeploying, and "just waiting longer", though as you can see above it was definitely broken a bunch earlier. Will definitely hard-ping you in this thread if I see any issues crop up again. Really appreciate the attention and support from both of you, big thanks! šŸ˜
Mig
Migā€¢6mo ago
The redeploy fixed it for sure. There are some really rare (1%) bugs that happen after a deploy. Also, really old (before August 2023) deploys need to be backfilled in our network router. Iā€™m doing this now.
Want results from more Discord servers?
Add your server