R
Railway•14mo ago
theodor

urgent: horizontal scaling limited at 5 replicas

Hi! We're on the teams plan. We're trying to scale as fast as we can, but the scaling is limited at 5 replicas. is there anything we can do to get past that?
70 Replies
Percy
Percy•14mo ago
Project ID: 120b5ec5-59d8-4087-84ae-4e0b3d934aa7
theodor
theodor•14mo ago
120b5ec5-59d8-4087-84ae-4e0b3d934aa7
Brody
Brody•14mo ago
now hold on, you have access to 32 vCPU and 32 GB of ram, and you are still hitting those limits?
theodor
theodor•14mo ago
yes
Brody
Brody•14mo ago
do you know how expensive the bill is gonna be? I mean absolutely no offense at all when I say this, but I think you may be running into inefficiencys in your code and are trying to throw compute at the problem
theodor
theodor•14mo ago
do you work for railway? i don'
Brody
Brody•14mo ago
I do not
theodor
theodor•14mo ago
then that's not useful
Brody
Brody•14mo ago
okay okay fair
theodor
theodor•14mo ago
we can add multi threading later but rn i can't exactly rewrite shit
Brody
Brody•14mo ago
@Angelo - pulling you in
theodor
theodor•14mo ago
thank you
angelo
angelo•14mo ago
Indeed you can! Real quick- why? What are you hosting?
theodor
theodor•14mo ago
rizzgpt.app
angelo
angelo•14mo ago
OH SICK congrats ok- jumping on and scaling
Brody
Brody•14mo ago
wow that is sick, that's come a long way since you first showed it off
theodor
theodor•14mo ago
tysm
Brody
Brody•14mo ago
Angelo, I thought the replica limit was increased to 10? was the frontend ui not updated to allow that yet?
angelo
angelo•14mo ago
Yea, I am confused about that I just grabbed lunch- going to work through this
theodor
theodor•14mo ago
were you able to manually scale it?
angelo
angelo•14mo ago
well- I am more concerned that you can't set it past five but will knock that out for you too
angelo
angelo•14mo ago
Railway
404 - Page not found
Railway is an infrastructure platform where you can provision infrastructure, develop with that infrastructure locally, and then deploy to the cloud.
angelo
angelo•14mo ago
so it seems that you aren't hitting your limits? ahhh nvm L https://railway.app/project/120b5ec5-59d8-4087-84ae-4e0b3d934aa7/service/d89b8d57-7c05-4bab-bf8a-27bc11f78cbc/metrics this right? tagging to confirm @theodor So looking at your logs, you seem to be processing a lot of requests that don't need to be? It seems that you are hitting /refresh a crazy amount of times when it shouldnt anyway, I digress
theodor
theodor•14mo ago
@Angelo that's right! sorry i just fixed a bug that should make things much better we may be able to stick with 5 /refresh token you mean? maybe ther'es something funky we're doing
angelo
angelo•14mo ago
yea either way! its bug on our end we are fixing the cap
theodor
theodor•14mo ago
tysm' but rn it's 10?
angelo
angelo•14mo ago
deploying new fix should be 20
theodor
theodor•14mo ago
ok we can ask to reduce it later once we don't need it
angelo
angelo•14mo ago
gotta sclae
theodor
theodor•14mo ago
yeah
angelo
angelo•14mo ago
you can lower the number (ideally this would be based on load)
theodor
theodor•14mo ago
yeah thanks so much for helping us here btw!
angelo
angelo•14mo ago
np! hit us up in this thread if you run into some challenges (also if you shout us out I will retweet hehe)
theodor
theodor•14mo ago
thank you! oh yeah definitely let me do it so far so good we also gixed a bug on our end @Angelo Can you bump us to 15? We're soon going to deploy some parallelization changes but it seems things are creeping
Brody
Brody•14mo ago
is it appropriate to say, suffering from success?
theodor
theodor•14mo ago
a little bit haha
angelo
angelo•14mo ago
the UI should be updated, can you type in 15 replicas and see if that works?
theodor
theodor•14mo ago
let me check! thanks it worked! redeploying
johns
johns•14mo ago
@Angelo @Brody 🙏 QQ - we're trying to add multiprocessing, but we need to use a Dockerfile. How does the port allocation work in this case for having replicas since I know Railway injected PORT during buildtime Or I guess a better question is - how does the replicas work behind the scenes?
Brody
Brody•14mo ago
you dont need to do anything different, that part works the exact same its just your service being duplicated your chosen amount, and then an incoming request is proxied to one of the services at a time, with (i think) round robin
johns
johns•14mo ago
are the replicas in separate "physical" instances? like do they have independent ports
Brody
Brody•14mo ago
since the PORT variable is auto generated, they would have diffent ports, yes
Adam
Adam•14mo ago
If you define a PORT variable they should(?) have the same port
Brody
Brody•14mo ago
even if you set a specfic PORT in the service variables, it doesn't make a difference, it would work the same if your next question is "can the replicas share data between each other" the answer is no, not natively
johns
johns•14mo ago
haha that wasn't the question I had but one sec, need to look into something Do you assign some sort of unique identifier that the replica would know? Like a REPLICA_ID env variable that's unique to the replica There's a workaround but just wanted to ask
Brody
Brody•14mo ago
indeed RAILWAY_REPLICA_ID
johns
johns•14mo ago
sick
Brody
Brody•14mo ago
very sick
angelo
angelo•14mo ago
same port, but we proxy it on your behalf, we are going to open up the internal networking stuff we do on your behalf to make this more custom cough private networks cough
johns
johns•14mo ago
I see I see
johns
johns•14mo ago
So I tried increasing the number of uvicorn workers and the vCPU is looking mad crazy - would love to understand why đź‘€
johns
johns•14mo ago
ah nevermind, it just came down. Seems to be that it was lagging
Brody
Brody•14mo ago
what have you increased workers to? you should be able to go up to 65 with 32 vCPUs
johns
johns•14mo ago
Just trying to gauge whether increasing workers on gunicorn is better than increasing replicas on Railway's side Right now, I've configured it to be 3 replicas and 4 workers
Brody
Brody•14mo ago
depends, what do your cpu metrics look like
johns
johns•14mo ago
johns
johns•14mo ago
After 1:30pm is 2 replicas and 4 workers - before 1:00pm is just 7 replicas I think it makes sense and I was just confused. Will let you know if I see any other issues fyi this is the script we're using on start service datadog-agent start && python manage.py migrate && ddtrace-run gunicorn backend.asgi:application -k uvicorn.workers.UvicornWorker --workers=4 Actually, on a second look, it does seem like the gunicorn workers are costing significantly more vCPUs - with only replicas (7 replicas), it seems like vCPU usage was 6~8, but with gunicorn it does seem a lot spikier
Brody
Brody•14mo ago
ah growing pains
johns
johns•14mo ago
@Angelo so underneath the hood, is the replica more like spinning up k8 pods? Because if that’s the case we shouldn’t use any workers on gunicorn From the speed in which the replicas get spun up I would assume that this is the case
Brody
Brody•14mo ago
I'm pretty sure railway just builds the image once, runs the image your chosen amount of times, then load balances incoming requests out to the replica set
angelo
angelo•14mo ago
yep- nothing crazy, just num containers, kinda like how docker compose would do it hmm I wonder why its like that?
theodor
theodor•14mo ago
Theodor Marcu (@theomarcu)
Honestly we wouldn't have been able to sustain 10x traffic to @RizzGPT_ over the past few days without the help of the @Railway team
Twitter
theodor
theodor•14mo ago
Thanks again for the help @Angelo and @Brody Things are more manageable now!
angelo
angelo•14mo ago
wait- were you at Priceton reunions?
theodor
theodor•14mo ago
haha not this year! i wish
angelo
angelo•14mo ago
I didn't go there but went and it was insane
theodor
theodor•14mo ago
haha that's awesome hope you had a lot of fun!
angelo
angelo•14mo ago
I did! ty for the shoutout (David our twtter guy) appreciates it!
theodor
theodor•14mo ago
haha anytime! we also appreciate the RT and glad you did! we were sadly too swamped with all the craziness to be there
angelo
angelo•14mo ago
I get that