Reading ~30gb sequentially from the volume make the ram go to ~30gb

ofc that doesn't happens locally. I'm test scaling a new compute heavy feature of my app before shipping and this is a blocker for me. This is a problem because I don't want to be billed 30gb for something I discard a few seconds later. I'm just looking at if this might be a bug and/or misuse before I start looking at alternative hoster (for this particular workload, I'm still happy for the rest) .
Solution:
i wrote a benchmark test (in Go) to write 30000 1MiB files (for a total of 30GiB) to disk at 250 files concurrently. running this locally of course there was no wild increase in memory. running this program on railway with the legacy runtime had my memory reach ~23GB....
Jump to solution
61 Replies
Percy
Percy3w ago
Project ID: 8bccf693-4059-4ef3-9dd0-55493979fdb7
Alaanor
Alaanor3w ago
8bccf693-4059-4ef3-9dd0-55493979fdb7
Alaanor
Alaanor3w ago
No description
Alaanor
Alaanor3w ago
I should note that it does get stuck there and will not go down again
Brody
Brody3w ago
are you loading the file into memory? and, reading a 30gb file from the disk to where?
Alaanor
Alaanor3w ago
my rust backend need to read all 180k file I have the attached volume with the service. They're about 170kb each locally it goes fine and doesn't fill up my ram
Brody
Brody3w ago
right but you are reading these files off of disk? where are they going beside into ram?
Alaanor
Alaanor3w ago
I'm not sure to understand the question I'm reading like literally at /cache/something/somefile.mspack and deserializing it, doing compute and drop /cache is where my volume is mounted at
Brody
Brody3w ago
so these files are loaded into memory then, I'm not sure why you are surprised that memory has increased?
Alaanor
Alaanor3w ago
once I have read and made use of the information how can I discard and let the unused ram go back ? locally it is discarded right after each read, so my ram never go up. I can run this program with 1gb or less ram
Brody
Brody3w ago
that wouldn't be a platform specific question
Alaanor
Alaanor3w ago
Well if that doesn't sound like a bug/misuse to you, then I got my answer I guess 😄
Brody
Brody3w ago
yes unfortunately this would be a code issue
angelo
angelo3w ago
Yep, you will need to deallocate that in Rust sounds like you aren't resolving your lifetimes afaik
Alaanor
Alaanor3w ago
I tried all I could think of really 😕 even with an explicit drop(variable_with_deserialized_data) at the end of each loop. Even after my job is done and clean it doesn't drop. I still want to precise that locally the same job never goes above a few mb of ram, on the same dataset.
Brody
Brody3w ago
nixpacks or dockerfile?
Alaanor
Alaanor3w ago
[build]
builder = "NIXPACKS"

[deploy]
startCommand = "./api"
healthcheckPath = "/health"
healthcheckTimeout = 100

[phases.setup]
nixpkgsArchive = 'a459b363de387c078080b719b30c54d8a79b4a3e'
nixPkgs = ["...", "ffmpeg"]
nixLibs = ["...", "dav1d"]
[build]
builder = "NIXPACKS"

[deploy]
startCommand = "./api"
healthcheckPath = "/health"
healthcheckTimeout = 100

[phases.setup]
nixpkgsArchive = 'a459b363de387c078080b719b30c54d8a79b4a3e'
nixPkgs = ["...", "ffmpeg"]
nixLibs = ["...", "dav1d"]
not sure that matter tho, also I do my build on github action because they're not trivial and I do a final railway up -e {env} -s {service_id} to upload
Brody
Brody3w ago
thats going to cause railway to run another build
Alaanor
Alaanor3w ago
are you sure you're did not mix 2 problems 😄 I was on something unrelated to deployment, I don't have deployment problem
Brody
Brody3w ago
i know, im getting side tracked, just wanted to point it out but when in doubt write a dockerfile that uses alpine instead of nixpacks have seen alpine based dockerfiles help with strange memory issues before plenty of times
Alaanor
Alaanor3w ago
I could give a try
Brody
Brody3w ago
cant hurt
Alaanor
Alaanor3w ago
unfortunately did almost everything but for alpine I need some special target compilation and I can't get it to work easily with my rust code. got too much deps I could try a docker image on ubuntu or something but at this point we're back to what nixpack does
Brody
Brody3w ago
ah yes the joys of rust, compiling
Alaanor
Alaanor3w ago
so I did a small experimentation, last peak is me running that heavy task (limited to 20k item to not go to 30gb again). I have changed the start command to while true; do free -h; sleep 5; done & ./api and here are the result. We can clearly see that the used column stay to 251gb, before and after the tasks (why does it show me the host machine spec's instead of my docker, no clue). But the column buff/cache is billed on me and indeed grew by a few gb.
No description
No description
Brody
Brody3w ago
buffer / cache is indeed included in the results of docker stats
Alaanor
Alaanor3w ago
I had some hope for a moment with https://linux.die.net/man/2/open O_DIRECT, it works locally, doesn't bump the buff/cache but on railway I get an error, I guess the filesystem doesn't accept this custom flag :sossa: this is pain
Brody
Brody3w ago
I wonder if the v2 runtime would also gather metrics that include the buffer / cache, but you would need to find a way to run your tests without a volume since any service with a volume defaults back to the legacy runtime despite the selector saying v2
alex
alex3w ago
Have you tried copying the volume’s contents into the container’s disk on startup? In that case you may be able to do local file system file streaming in rust and it shouldn’t increase the memory Local file system is a little different than docker mounted volumes i think
Brody
Brody3w ago
volumes are ext4 mounted zvols
alex
alex3w ago
I forgot railway doesn’t actually use docker
Brody
Brody3w ago
legacy runtime does
alex
alex3w ago
Man I don’t even use railway 🗣️
Brody
Brody3w ago
v2 uses podman
alex
alex3w ago
Man portable air defense? Oh wait that’s manpad
Alaanor
Alaanor3w ago
Since I could not find a solution with railway for this particular thing, I bought a server somewhere else, although the disk io isn't as good as railway and this add me some complexity for deployment and monitoring :( But yeah I can't afford adding 200$+ to my monthly billing just to read a few files sometimes. I still use railway for a lot of other stuff and I'm happy with those. I just figured out that I should railway where it is helping me instead of trying to fight it. No hate, I can understand why buff/cached is counted. Just wanted to give an update for future people searching this thread.
Solution
Brody
Brody3w ago
i wrote a benchmark test (in Go) to write 30000 1MiB files (for a total of 30GiB) to disk at 250 files concurrently. running this locally of course there was no wild increase in memory. running this program on railway with the legacy runtime had my memory reach ~23GB. i then switched to the v2 runtime and ran the program again and the memory never increased above ~45MB, it also wrote the data a bit faster. i then re-ran the test on the v2 runtime but this time configured it to write 50GIB worth of files and still saw no wild increase in memory. and just to be sure one last time, i switched back to the legacy runtime and ran the test to write the 50GIB worth of files (same concurrency), and the memory this time peaked at 32GB. tl;dr this issue is fixed in the v2 runtime but only the legacy runtime has support for volumes (even if you select the v2 runtime and you have a volume it will run with the legacy runtime)
Brody
Brody3w ago
heres a memory graph that backs up these statements
No description
Brody
Brody3w ago
yes I know I'm writing files instead of reading them, but the same issue is being surfaced
alex
alex3w ago
interesting
Brody
Brody3w ago
I need a sticker "V2 fixes it" v2 builder / runtime / proxy has just simply fixed a lot of issue thus far
alex
alex3w ago
when is v2 going ga ?
Brody
Brody3w ago
asap once the bugs are fixed
alex
alex3w ago
nice
Brody
Brody3w ago
from the sounds of it the legacy runtime will not be running on bare metal it sounds like bare metal will be v2 runtime exclusive
alex
alex3w ago
oh nice
angelo
angelo3w ago
We are taking our time btw We aren’t going to V2 cutover until everything is polished
Brody
Brody3w ago
we aren't going to remove the legacy option*
angelo
angelo3w ago
Ofc we are moving fast- but migrating running workloads over will take time Its likely that we have a one way migration and will remove the option, like we did for BuildPacks to NixPacks
Brody
Brody3w ago
thankfully I wasn't around for that 😆
angelo
angelo3w ago
It was very painful But we did it And we’ll accomplish this as well
Brody
Brody3w ago
I just wanna know if we will see V2 runtime support volumes before bare metal, it's whats stopping alaanor from running this project on railway
angelo
angelo3w ago
Ofc, this will be added before metal Completely stateless workloads aren’t very useful
Brody
Brody3w ago
char said otherwise, unless I was misunderstanding him
angelo
angelo3w ago
Metal workloads will mostly be Trial and experimental workloads until we flight Railway Metal that requires volunes If the servers are plugged in, why not serve from then But we won’t stop the metal rollout until we have vol. support
Brody
Brody3w ago
right but people could benefit from volume support on the v2 runtime with the current gcp hosts, like OP, or everyone running uptime kuma that are getting EHOSTUNREACH
angelo
angelo3w ago
Yea, heard, it’s a, we are speedrunning all short coming fixes when we can Never enough hands on boards But its not a OR its a AND Metal is happening on a different timeframe than V2 cutover It just so happens that Metal will be V2 only, no sense in extending the lifetime of Legacy
Brody
Brody3w ago
ay at least i was right in that regard either way, would you say it would be safe to mark this thread as solved, and would i be correct in saying this wont be getting fixed on the legacy runtime?
angelo
angelo3w ago
Yep
Alaanor
Alaanor3w ago
This is really cool, appreciate the finding a lot. Thanks 👍 I'll be checking the railway changelog for v2 with volume frequently and hopefully one day I can be fully back on railway :)
Brody
Brody3w ago
hopefully!