Reading ~30gb sequentially from the volume make the ram go to ~30gb
ofc that doesn't happens locally. I'm test scaling a new compute heavy feature of my app before shipping and this is a blocker for me. This is a problem because I don't want to be billed 30gb for something I discard a few seconds later. I'm just looking at if this might be a bug and/or misuse before I start looking at alternative hoster (for this particular workload, I'm still happy for the rest) .
Solution:Jump to solution
i wrote a benchmark test (in Go) to write 30000 1MiB files (for a total of 30GiB) to disk at 250 files concurrently.
running this locally of course there was no wild increase in memory.
running this program on railway with the legacy runtime had my memory reach ~23GB....
67 Replies
Project ID:
8bccf693-4059-4ef3-9dd0-55493979fdb7
8bccf693-4059-4ef3-9dd0-55493979fdb7
I should note that it does get stuck there and will not go down again
are you loading the file into memory?
and, reading a 30gb file from the disk to where?
my rust backend need to read all 180k file I have the attached volume with the service. They're about 170kb each
locally it goes fine and doesn't fill up my ram
right but you are reading these files off of disk? where are they going beside into ram?
I'm not sure to understand the question
I'm reading like literally at /cache/something/somefile.mspack and deserializing it, doing compute and drop
/cache is where my volume is mounted at
so these files are loaded into memory then, I'm not sure why you are surprised that memory has increased?
once I have read and made use of the information how can I discard and let the unused ram go back ?
locally it is discarded right after each read, so my ram never go up. I can run this program with 1gb or less ram
that wouldn't be a platform specific question
Well if that doesn't sound like a bug/misuse to you, then I got my answer I guess 😄
yes unfortunately this would be a code issue
Yep, you will need to deallocate that in Rust
sounds like you aren't resolving your lifetimes afaik
I tried all I could think of really 😕 even with an explicit
drop(variable_with_deserialized_data)
at the end of each loop. Even after my job is done and clean it doesn't drop. I still want to precise that locally the same job never goes above a few mb of ram, on the same dataset.nixpacks or dockerfile?
not sure that matter tho, also I do my build on github action because they're not trivial and I do a final
railway up -e {env} -s {service_id}
to uploadthats going to cause railway to run another build
are you sure you're did not mix 2 problems 😄 I was on something unrelated to deployment, I don't have deployment problem
i know, im getting side tracked, just wanted to point it out
but when in doubt write a dockerfile that uses alpine instead of nixpacks
have seen alpine based dockerfiles help with strange memory issues before plenty of times
I could give a try
cant hurt
unfortunately did almost everything but for alpine I need some special target compilation and I can't get it to work easily with my rust code. got too much deps
I could try a docker image on ubuntu or something but at this point we're back to what nixpack does
ah yes the joys of rust, compiling
so I did a small experimentation, last peak is me running that heavy task (limited to 20k item to not go to 30gb again). I have changed the start command to
while true; do free -h; sleep 5; done & ./api
and here are the result. We can clearly see that the used
column stay to 251gb, before and after the tasks (why does it show me the host machine spec's instead of my docker, no clue). But the column buff/cache
is billed on me and indeed grew by a few gb.buffer / cache is indeed included in the results of docker stats
I had some hope for a moment with https://linux.die.net/man/2/open O_DIRECT, it works locally, doesn't bump the buff/cache but on railway I get an error, I guess the filesystem doesn't accept this custom flag :sossa:
this is pain
I wonder if the v2 runtime would also gather metrics that include the buffer / cache, but you would need to find a way to run your tests without a volume since any service with a volume defaults back to the legacy runtime despite the selector saying v2
Have you tried copying the volume’s contents into the container’s disk on startup?
In that case you may be able to do local file system file streaming in rust and it shouldn’t increase the memory
Local file system is a little different than docker mounted volumes i think
volumes are ext4 mounted zvols
I forgot railway doesn’t actually use docker
legacy runtime does
Man I don’t even use railway
🗣️
v2 uses podman
Man portable air defense?
Oh wait that’s manpad
Since I could not find a solution with railway for this particular thing, I bought a server somewhere else, although the disk io isn't as good as railway and this add me some complexity for deployment and monitoring :( But yeah I can't afford adding 200$+ to my monthly billing just to read a few files sometimes. I still use railway for a lot of other stuff and I'm happy with those. I just figured out that I should railway where it is helping me instead of trying to fight it. No hate, I can understand why buff/cached is counted. Just wanted to give an update for future people searching this thread.
Solution
i wrote a benchmark test (in Go) to write 30000 1MiB files (for a total of 30GiB) to disk at 250 files concurrently.
running this locally of course there was no wild increase in memory.
running this program on railway with the legacy runtime had my memory reach ~23GB.
i then switched to the v2 runtime and ran the program again and the memory never increased above ~45MB, it also wrote the data a bit faster.
i then re-ran the test on the v2 runtime but this time configured it to write 50GIB worth of files and still saw no wild increase in memory.
and just to be sure one last time, i switched back to the legacy runtime and ran the test to write the 50GIB worth of files (same concurrency), and the memory this time peaked at 32GB.
tl;dr this issue is fixed in the v2 runtime but only the legacy runtime has support for volumes (even if you select the v2 runtime and you have a volume it will run with the legacy runtime)
heres a memory graph that backs up these statements
yes I know I'm writing files instead of reading them, but the same issue is being surfaced
interesting
I need a sticker "V2 fixes it"
v2 builder / runtime / proxy has just simply fixed a lot of issue thus far
when is v2 going ga ?
asap once the bugs are fixed
nice
from the sounds of it the legacy runtime will not be running on bare metal
it sounds like bare metal will be v2 runtime exclusive
oh nice
We are taking our time btw
We aren’t going to V2 cutover until everything is polished
we aren't going to remove the legacy option*
Ofc we are moving fast- but migrating running workloads over will take time
Its likely that we have a one way migration and will remove the option, like we did for BuildPacks to NixPacks
thankfully I wasn't around for that 😆
It was very painful
But we did it
And we’ll accomplish this as well
I just wanna know if we will see V2 runtime support volumes before bare metal, it's whats stopping alaanor from running this project on railway
Ofc, this will be added before metal
Completely stateless workloads aren’t very useful
char said otherwise, unless I was misunderstanding him
Metal workloads will mostly be Trial and experimental workloads until we flight Railway Metal that requires volunes
If the servers are plugged in, why not serve from then
But we won’t stop the metal rollout until we have vol. support
right but people could benefit from volume support on the v2 runtime with the current gcp hosts, like OP, or everyone running uptime kuma that are getting EHOSTUNREACH
Yea, heard, it’s a, we are speedrunning all short coming fixes when we can
Never enough hands on boards
But its not a OR its a AND
Metal is happening on a different timeframe than V2 cutover
It just so happens that Metal will be V2 only, no sense in extending the lifetime of Legacy
ay at least i was right in that regard
either way, would you say it would be safe to mark this thread as solved, and would i be correct in saying this wont be getting fixed on the legacy runtime?
Yep
This is really cool, appreciate the finding a lot. Thanks 👍 I'll be checking the railway changelog for v2 with volume frequently and hopefully one day I can be fully back on railway :)
hopefully!
@Brody @Angelo I saw v2 volume are now a thing and so I spent the day to setup the stuff and try on railway again but unfortunately it's still stuck this way. Not to complain or anything, just wanted to share that it did not magically fix that, as we though it perhaps would.
you might not be on the v2 runtime
at least the UI said I was on it, I remember you said that it might be a lie from the UI because it would not works with volume, but now we got volume on v2 so I though I could maybe trust that UI
not all of railway's hosts support volumes on the v2 runtime, a surefire way to be sure you are on the v2 runtime would be to check for container event logs like "starting container"
Just wanted to tell that I have finally moved back this particular service to railway and it's working great now 👍 thanks again for all
that's awesome!