Should we expect an increase in bandwidth usage with V2 runtime?
Recently (amongst many other changes..!) swapped our API gateway to V2. And we saw a large increase in bandwidth. Is this expected? Swapping back to legacy "fixed" it. I can investigate further and try to produce a "clean" repro (turn on V2, redeploy, run for a few hours, and revert) if this is not expected behaviour. My other services showed similar bandwidth increases, however the data there is much noisier due to all the other changes I was making at the time.
Project ID
4c3b4b0e-006a-407e-90c7-9c3031cd622f
The image shows the window of time we had V2 runtime enabled57 Replies
Project ID:
4c3b4b0e-006a-407e-90c7-9c3031cd622f
I would very much appreciate if you could come up with a clean way to reproduce this
im running a test today, will get back with result. i've just swapped our gateway over to v2 again. ill run it for a bit and show a comparison
im not too sure if a in use app is a very good reproducible example
i dont have time to put in any more effort, unfortunately. I am the only programmer at my studio and i am spread too thin. thats why we use platforms like railway!
so ALL I changed was the runtime to V2 on our app. nothing else was changed. I clicked V2, got prompted to deploy the changes, I hit ok and thats all
s
our bandwidth use just doubles when we enable v2 runtime, along with estiamte bill for the month etc
i am now reverting the change ($$$$) and can report back if and when the bandwidth drops back in line
project id is
4c3b4b0e-006a-407e-90c7-9c3031cd622f
and the service in question is 3545427b-d98c-42ec-b5ac-f9cc4326e3c4
if any railway dev wants to poke around and investigate
i guess its more than double..! almost triple ši created an example project, with 3 services, 2 services to download a file on a loop with a fixed download size and download speed, and the other service to serve the file, one of the download services used the legacy runtime, and the other used the v2 runtime.
i am unable to reproduce, in fact the v2 runtime uses a tiny bit less network
thanks for trying to reproduce it!
ignore the large bumps, i was dialing in the settings as to not rack up a massive bill
i dont know what it might be. but from what i understand legacy will eventually be disabled and we will be pushed onto v2. and v2 is supposed to 'just work' with no changes right? Its not something we need to concern our selves with?
nice yeah, do you have any idea what it might be?
perhaps regions are involved? we host on US East
i dont think its just the v2 runtime, with your app there are many other factors at play
maybe private traffic is being counted incorrectly in v2 runtime when your region is not the default
though again, with this one change, it will more than triple our bandwidth bill , and if its something thats just supposed to work then i think its something that railway may want to investigate before pushing it to their users?
private traffic shouldnt be counted at all, regardless of region
i know
but good idea
well v2 is the default for all new services
yeah I noticed - i recently split the responsibilites a DIFFERENT service (in the same project) in 2. the service was doing 2 jobs at once, essentially. A rest API and a socketio/realtime comms service (chat, etc). I basically added a switch to make the service act as one or the other, because i wanted to get a good idea how much of our bandiwdth bill was coming from the socketio/realtime stuff vs rest api external database queries
so anyway, that service was using LEGACY (its been around for a while)
i split it into two, made the existing service into Rest API only, and the NEW service I made into the socketio/realtime service..
the NEW service was automatically v2 runtime
bandwidth usage was HUGE
again, like a 3x jump in normal usage
i eventually figured out the v2 switch was '''''to blame'''''
set it to Legacy
and now old service + new service bandwidth = old combined service bnadwidth, as expeted
didnt you say that websocket connections failed on the v2 runtime, or was that the edge proxy?
no the websocket connections failed in 'edge proxy'
(Though I may have messed up my words, sorry - i was knee deep in a bunch of problems when i was debugging all that, as you can tell)
i ran my test with the edge proxy on, im going to disable that and try again
the gateway has edge proxy enabled! (the one from the test today)
my current config is:
Gateway: Edge Proxy ON, Runtime: Legacy
Rest Api: Edge Proxy ON, Runtime: Legacy
SocketIO: Edge Proxy OFF, Runtime: Legacy
what service are these graphs from?
the graphs are from API Gateway
here is the moments before i SPLIT my restapi+socket io service into TWO the other day
the purple lines are Socketio/rest api services (you can see where I split it into two (two purple lines) and enabled v2 runtime
did you ever get any errors from the socketio service when you said the edge proxy wasnt working for you?
and the YELLOW is my api gateway where i ALSO enabled v2 runtiem at the same time
then swiftly revereted v2 -> legacy, and you can see my traffic back to expected levels -- where API gateway (yellow) looks usual, and the two purple lines 'add up' to approx what the traffic was before the split
i have not investigated that yet, i am not sure when i will have a chance to at the moment, i will open a separate help thread for that if i can confirm that Edge Proxy -> ON just breaks my Socket IO functionality
just trying to think of what else is 'strange' about my setup, but, mm, the region being different is the only thing i can think of that isn't "stock standard". the nodejs app is just a nestjs app. especially the api gateway one is VERY straightforward and simple. it just proxies requests to one of 3 (dev/stg/prd) servers (using internal url) based on some headers in the request. it exposes a health check endpoint. it also exposes an endpoint to query info about the 3 servers. finally, it has a redis client connection to receive updates about changes to those 3 servers (rare occurance. once a week or two when I push out an update to the game)
API gateway, just now, 20 mins after reverting back to Legacy:
and just to be clear, the service works just fine on the v2 rutime right?
yeah!
OH
also
just remembered we also have an OTEL collector that my server is reporting its data too (again, internal)
so the start command for the api gateway is actually
node --require '@opentelemetry/auto-instrumentations-node/register' dist/apps/ssr-api-gateway/main
to do all the opentel auto instrument stuff
i can test with it disabled + v2 maybeare you sure there arent errors anywhere, and something is going into a retry loop and bloating the bandwidth?
as a side node, for the window of time we were running on V2 runtime, the API gateways response time was great and so stable š
possible? I can never discount it i guess? but it would have to be in response to a client request. no one else pokes this server. just requests from clients in the game. and the request is then proxied and the response sent back
but logs are clean, and I DO get errors when proxies fail in other cases
then that rules that out
just double checking logs
yeah nothing suspecious
no errors in the last 5 hours, except for when i reverted back to legacy and redeployed š
ill try v2 runtime w/o otel instrumentation, just in case
:3HC_Shrug:
hypothetically, what would happen if railway isnt able to determine the cause of your increased network?
(its still the weekend, i cant bring anyone in yet anyway)
i would stick to legacy, and if legacy is going to be removed assuming the bandwidth costs dont come down by that time then we will have to leave! I know bare metal is around the corner though and I am in talks with some lovely folk at RW about trying it out for some of our bandwidth heavy services. (they've been lovely to deal with) we also have some major bandwidth optimisations coming soon so that will help bring the cost down too!
but to give you an idea, if our bandwidth use just tripled then we would be paying about 1500 USD for bandwidth which is, ugh, a LOT for us. It be worth investing effort to port somewhere else with cheaper costs at that point.
1500usd is nothing
haha what are YOU doing!
š
tests to try and reproduce your issue
ah , network test š
oh boy, i hope railway slashes that bill for you for helping people out
conductors get a 100% off coupon
oh well choo choo
choo choo indeed
i will be bringing in char (who i think has the most to do with the v2 runtime) as soon as i feel he's available
no dice with disabling otel, still high bandwidth
sweet thanks, but no rush since we can just stick to legacy for now!
how curious hey
by the way, if i wanted to put together as minimal of a repo as i could, whats an easy way to load test?
spin up two services and get them to talk to each toher via public url. my repro service will be a stripped down nestjs rest api that """proxies""" messages
i just have this
the service in the middle serves a infinite file (in the sense that the response is null bytes) and the downloader services request a 1gb file from it on a loop and download it at a fixed 5MB/s
super controlled environment with no variables other than v2 or legacy
did you just whip up the code for that downloader service yourself?
yeah
sweet, cool
all go services
nice, i want to learn go. it seems nice, light, powerful
indeed it is
ill try to talk to char about this when hes in tmr
Which one? A through to Z. You've got a lot to pick from?
I'll see myself out
"this" being the topic of the title for this thread
I just meant because you were going to speak to "char"... You know what never mind. It was a terrible joke haha
ohhhhh I see what you mean