Railway•2y ago

urgent: horizontal scaling limited at 5 replicas

Hi! We're on the teams plan. We're trying to scale as fast as we can, but the scaling is limited at 5 replicas. is there anything we can do to get past that?

70 Replies

Percy•2y ago

Project ID: 120b5ec5-59d8-4087-84ae-4e0b3d934aa7

theodorOP•2y ago

120b5ec5-59d8-4087-84ae-4e0b3d934aa7

Brody•2y ago

now hold on, you have access to 32 vCPU and 32 GB of ram, and you are still hitting those limits?

theodorOP•2y ago

yes

Brody•2y ago

do you know how expensive the bill is gonna be? I mean absolutely no offense at all when I say this, but I think you may be running into inefficiencys in your code and are trying to throw compute at the problem

theodorOP•2y ago

do you work for railway? i don'

Brody•2y ago

I do not

theodorOP•2y ago

then that's not useful

Brody•2y ago

okay okay fair

theodorOP•2y ago

we can add multi threading later but rn i can't exactly rewrite shit

Brody•2y ago

@Angelo - pulling you in

theodorOP•2y ago

thank you

angelo•2y ago

Indeed you can! Real quick- why? What are you hosting?

theodorOP•2y ago

rizzgpt.app

angelo•2y ago

OH SICK congrats ok- jumping on and scaling

Brody•2y ago

wow that is sick, that's come a long way since you first showed it off

theodorOP•2y ago

tysm

Brody•2y ago

Angelo, I thought the replica limit was increased to 10? was the frontend ui not updated to allow that yet?

angelo•2y ago

Yea, I am confused about that I just grabbed lunch- going to work through this

theodorOP•2y ago

were you able to manually scale it?

angelo•2y ago

well- I am more concerned that you can't set it past five but will knock that out for you too

angelo•2y ago

https://railway.app/project/120b5ec5-59d8-4087-84ae-4e0b3d934aa7/service/2686d71b-8276-4e98-8988-c3bf7faffdb2/metrics

Railway

404 - Page not found

Railway is an infrastructure platform where you can provision infrastructure, develop with that infrastructure locally, and then deploy to the cloud.

angelo•2y ago

so it seems that you aren't hitting your limits? ahhh nvm L https://railway.app/project/120b5ec5-59d8-4087-84ae-4e0b3d934aa7/service/d89b8d57-7c05-4bab-bf8a-27bc11f78cbc/metrics this right? tagging to confirm @theodor So looking at your logs, you seem to be processing a lot of requests that don't need to be? It seems that you are hitting /refresh a crazy amount of times when it shouldnt anyway, I digress

theodorOP•2y ago

@Angelo that's right! sorry i just fixed a bug that should make things much better we may be able to stick with 5 /refresh token you mean? maybe ther'es something funky we're doing

angelo•2y ago

yea either way! its bug on our end we are fixing the cap

theodorOP•2y ago

tysm' but rn it's 10?

angelo•2y ago

deploying new fix should be 20

theodorOP•2y ago

ok we can ask to reduce it later once we don't need it

angelo•2y ago

gotta sclae

theodorOP•2y ago

yeah

angelo•2y ago

you can lower the number (ideally this would be based on load)

theodorOP•2y ago

yeah thanks so much for helping us here btw!

angelo•2y ago

np! hit us up in this thread if you run into some challenges (also if you shout us out I will retweet hehe)

theodorOP•2y ago

thank you! oh yeah definitely let me do it so far so good we also gixed a bug on our end @Angelo Can you bump us to 15? We're soon going to deploy some parallelization changes but it seems things are creeping

Brody•2y ago

is it appropriate to say, suffering from success?

theodorOP•2y ago

a little bit haha

angelo•2y ago

the UI should be updated, can you type in 15 replicas and see if that works?

theodorOP•2y ago

let me check! thanks it worked! redeploying

johns•2y ago

@Angelo @Brody 🙏 QQ - we're trying to add multiprocessing, but we need to use a Dockerfile. How does the port allocation work in this case for having replicas since I know Railway injected PORT during buildtime Or I guess a better question is - how does the replicas work behind the scenes?

Brody•2y ago

you dont need to do anything different, that part works the exact same its just your service being duplicated your chosen amount, and then an incoming request is proxied to one of the services at a time, with (i think) round robin

johns•2y ago

are the replicas in separate "physical" instances? like do they have independent ports

Brody•2y ago

since the PORT variable is auto generated, they would have diffent ports, yes

Adam•2y ago

If you define a PORT variable they should(?) have the same port

Brody•2y ago

even if you set a specfic PORT in the service variables, it doesn't make a difference, it would work the same if your next question is "can the replicas share data between each other" the answer is no, not natively

johns•2y ago

haha that wasn't the question I had but one sec, need to look into something Do you assign some sort of unique identifier that the replica would know? Like a REPLICA_ID env variable that's unique to the replica There's a workaround but just wanted to ask

Brody•2y ago

indeed RAILWAY_REPLICA_ID

johns•2y ago

sick

Brody•2y ago

very sick

angelo•2y ago

same port, but we proxy it on your behalf, we are going to open up the internal networking stuff we do on your behalf to make this more custom cough private networks cough

johns•2y ago

I see I see

johns•2y ago

So I tried increasing the number of uvicorn workers and the vCPU is looking mad crazy - would love to understand why 👀

johns•2y ago

ah nevermind, it just came down. Seems to be that it was lagging

Brody•2y ago

what have you increased workers to? you should be able to go up to 65 with 32 vCPUs

johns•2y ago

Just trying to gauge whether increasing workers on gunicorn is better than increasing replicas on Railway's side Right now, I've configured it to be 3 replicas and 4 workers

Brody•2y ago

depends, what do your cpu metrics look like

johns•2y ago

johns•2y ago

After 1:30pm is 2 replicas and 4 workers - before 1:00pm is just 7 replicas I think it makes sense and I was just confused. Will let you know if I see any other issues fyi this is the script we're using on start service datadog-agent start && python manage.py migrate && ddtrace-run gunicorn backend.asgi:application -k uvicorn.workers.UvicornWorker --workers=4 Actually, on a second look, it does seem like the gunicorn workers are costing significantly more vCPUs - with only replicas (7 replicas), it seems like vCPU usage was 6~8, but with gunicorn it does seem a lot spikier

Brody•2y ago

ah growing pains

johns•2y ago

@Angelo so underneath the hood, is the replica more like spinning up k8 pods? Because if that’s the case we shouldn’t use any workers on gunicorn From the speed in which the replicas get spun up I would assume that this is the case

Brody•2y ago

I'm pretty sure railway just builds the image once, runs the image your chosen amount of times, then load balances incoming requests out to the replica set

angelo•2y ago

yep- nothing crazy, just num containers, kinda like how docker compose would do it hmm I wonder why its like that?

theodorOP•2y ago

https://twitter.com/theomarcu/status/1664319413854650379?s=20 Here's a tweet giving you a shoutout!

Theodor Marcu (@theomarcu)

Honestly we wouldn't have been able to sustain 10x traffic to @RizzGPT_ over the past few days without the help of the @Railway team

Twitter

theodorOP•2y ago

Thanks again for the help @Angelo and @Brody Things are more manageable now!

angelo•2y ago

wait- were you at Priceton reunions?

theodorOP•2y ago

haha not this year! i wish

angelo•2y ago

I didn't go there but went and it was insane

theodorOP•2y ago

haha that's awesome hope you had a lot of fun!

angelo•2y ago

I did! ty for the shoutout (David our twtter guy) appreciates it!

theodorOP•2y ago

haha anytime! we also appreciate the RT and glad you did! we were sadly too swamped with all the craziness to be there

angelo•2y ago

I get that

Gaming

Programming

urgent: horizontal scaling limited at 5 replicas