How should I store/load my data for network storage?
Hi,
I've been keeping my data in an sql database which is excruciatingly slow on runpod with a network storage.
But I don't see any obvious alternative.. In what type of file could my data live in on the disk, in order for it to be loaded fast in the network storage scenario of runpod?
But I don't see any obvious alternative.. In what type of file could my data live in on the disk, in order for it to be loaded fast in the network storage scenario of runpod?
70 Replies
How much data are u loading?
how are u loading it? why not a CSV / JSON file? Or something? Must it be SQL? Is this millions of rows? etc.
Anything read from network storage is as slow as constipation, network storage is garbage and unusable
hi, my data is not text it's genomics so right now i'm saving numpy arrays in an sql
it's about 200G
Yeah, the way I see network storage is as a External Hard drive lol
Do u need to load all 200G? at once?
Does the data change?
An external hard drive is MUCH faster than network storage
it does not change, i need to read it once per epoch, ideally in random order
Network storage is like sending stuff via a pigeon, but then again a pigeon will probably even be faster
How much data are u loading per epoch?
all of it eventually
How are u reading it? is ur data indexes properly?
So just to confirm:
You have a SQLLite database for about 200GB, you are randomly selecting sizes of data for epoch cycles?
You can maybe try chatting with @JM about getting some kind of custom storage for your use case, but you may have to commit to a certain level of spend or something.
i did a test, a script that just loads all the records from the db. In my system it's doing something like 6000/second in runpod it was doing 2/second
Honestly, you could probably just divide your data straight up into files of N x Epoch_Size in your network storage. Create a hashmap of all the file names. Randomly pick one, remove from hashmap.
Load the file for the next X cycles. And in the background you can also in a different thread load into a variable the next file for the next epoch cycle
I shuffle the main index and query the db index by index. Every new epoch is a new shuffle
Not surprising, Network storage is unbarably slow
I feel, that something about this feels wrong...
What does shuffle the main index mean?
But if it was me, I'd do this
u might not even need to load something in the background
But i'd just straight up divide to the pure text format, and avoid the SQLite
SQLite, with all the query time could also be straight up hurting u just due to the way the searching would be working over 200GBs
Both have to load from disk, so I don't see how it will make any difference
essentially
order = np.arrange(lenofdatabase) +1
np.shuffle(order)
then
for i in order:
SELECT WHERE i
fetch one
SQLite, has to keep a tree of his data as he goes indexes for a shuffle
Yea this is probably bad
but my data is not text!
its looks like it doing a full table scan
Ah what is the data tho?
its just np arrays right?
numpy arrays he said
u can serialize it into something
yep numpy arrays
Honestly, you could do some sort of optimization around that sql query to do like preloaded batches of randomness
that can't be faster than loading an array from the disk !
this looks like a random selection every time
over the full database
Well, if ur loading from a SQLite database, or a text, its both from disk, but yeah, i get the point. I think we can go the route of fixing hte SQL Query
arent databases incredibly fast at fetching record i from a table when i is a proper index? that's what i'm doing, in a specific order
Let me read over this sql query some more and think xD been a bit since ive worked with it
what ur trying to do
Network storage is so slow you are probably better off storing your data in a cloud mongodb or something and fetching from there, I am sure it will be much faster
and as i said, this system is blazing fast on my system
so what if I don't use network storage
but instead spawn a pod with the needed space ?
and just move the data in there in 'runtime' ?
yeah because your system isn't using network storage that is at least 20 times slower than normal disk
Yea that would work too
I guess can do a small sample size for testing
on a CPU Pod or something
I guess the issue I worry about is that if you have 200GB worth of data, sending off an individual SQL statement per index
But as Ashelyk said, maybe mongodb / planetscale etc be good, but also id worry about the cost ur incurring with that much data
enter my other problem with all this.
It looks like my work is blocking tcp ports but they won't acknowledge it. So i cant scp data from work. I have to do it from home at 10Mbs š
doesn't your work allow normal port 22 though? then you can get a cheap vm from scaleway/digital ocean/linode etc and use it as a jump box from your work to RunPod
I guess my immediately thought is that what you could do is that you could for ex as you already are doing:
1) Create an array of random indexes, so something like:
[x, y, z, a, b, c ...]
2) Do a batch fetch of instead individual indexes, pull like 20 indexes:
SELECT * FROM your_table WHERE id IN (1, 2, 3, 4, 5);
3) Then in the background during epoch, I guess this takes time, load another 20 indexes or whatever amount in the background so it is immediately in memory for when the epoch is done this could be harder than just whatever i state haha, cause now u need to now do parallel work. But python provides standard library for producer / consumer data sort of patterns.
I think this is better than what you do now, b/c rn, you are pulling individually in synchronous order I assume
Benefits:
1) You aren't sending like individual index queries over 200GB
2) You are loading in the background during each epoch, so the next is ready
Essentially the background thread, can just add to a queue if the queue is < some X size, so that it is ready in memory and keep just checking if it needs to add to the queue
Yeah, I think something like this would probably work. Even though SQL is fast, sending 200GB worth of individual SQL queries is also just inefficient. (Even if network storage is slow too haha)
Or at least just the batch processing itself is good enough could be, might be worth testing this by itself
Orrrrrr.. as u said just make something big enough to hold ur dataset on container lol. see if it worth going that route too
hey thanks for all this, i'll get back to it asap, but i need to attend to some scary bureaucracy right now !
ALso just btw, making individual sql queries like that, sql tables underneath the hood are usually some sort of tree, so why it can also take a long time sending Requests * Log(N) time probably, why can make sense to multithread it, or batch process it too (just to reduce network overhead requests across to the network drive / sqllite probably doing optimization when ur batch processing multiple indexes at once vs one by one)
Okay so summary of stuff:
1) SQLLite individual queries like that can be slow b/c you are doing a number of indexes * Log(N) requests b/c SQLLite is using a tree under the hood. Or NLogN, which itself is very slow.
2) You can make it go faster by batch processing, so you get something like
N/BatchSize * LogN requests + also SQL can probably do some optimizations under the hood so you aren't just constantly making tons of round trips back and forth.
3) You can use a producer/consumer pattern, where one thread constantly keeps a multithreaded safe queue filled to X size, so that your main thread can keep on just checking the queue for new data, to pull from and run epoch on, so u can parallel the time when ur working on an epoch, to make the trip to fill the queue.
Gl gl
I think I had actually tried the batch SELECT at some point in the past and didn't see any big improvement in speedup
Got it, so your best bet is probably still Batch + the producer/consumer pattern then
Chatgpt example š
This probably isn't fully correct from what I read, but you would get the idea
yeah i'm familiar with queues in python
Also, I would say actually, that spitting to a text file is faster than a SQL tree now, b/c u need to imagine, that actually, creating a hashmap of ur file names, and reading it, is an O(N) selection time as u randomly choose + remove from the hashmap.
Vs a tree introduces additional overrhead to query the SQL table.
Since u are going to read through the entire 200GB anyways, a tree underneath the SQL table isnt necessary
Ik u said is numpy arrays
but serializing it and reading it prob wont be that bad. u can try on a smaller dataset, but just my two cents, i wouldnt know without testing, but just in theory in my head a SQL database does still have overhead even with indexes bc its usually a B-Tree underneath the hood.
Both in my head still read from disk so really ur cost is serialization + deserialization + query time, and query time is probably ur largest cost right now + also network storage is slow.
but yeah xD sorry for the long text
hopefully some avenues to look down
i could just dump the np arrays as files if i'm reading files from the disk
should still be much faster than deserializing, plus much less space
Yeah, could try on a small dataset vs SQLLite, see what the cost is to query over a SQLLite, vs just keeping a hashmap of all the file names; selecting one at random; remove from hashmap + read that file
but i would also probably run into os problems
i don't think 100k or so files would make the filesystem happy
Yea... haha. could batch it up, or if its on the container storage and not in network volume, could be better.
more like 500k files actually
Could be the answer is just move outside of /workspace lolol
actually with that many files on network drive, ull actually fill it up
yeah maybe i just stuff it in the docker image
*Prob not stuff it in the docker image, cause u cant build one that big, but, u can keep it on network storage, and zip it and copy it over
to outside /workspace
but i cant both have a network image AND lots of space in / can i
U can
Ive had like a 400GB container storage + a network storage
before
I had a 250GB dataset, and I wanted to unzip it, so I made a 400GB container storage to move stuff onto by just moving my file under /workspace over to /container
ah the container disk option ?
Yea
Container disk is stuff that will be reset on the start/stop of the pod
or essentially outside the /workspace
oke that's the solution then
But it is directly on the computer too
i spawn with big container, move the data there
Yea
and then optimize from there
in /scratch or something
work there, then move it out
Can probably try on a smaller dataset first
before u go moving the full 200GB
u can also if u do end up moving it, try to chunk and parallel the copying over might be possible
so u arent just synchronously moving 200GB over
yeap
although that move SHOULD be fast
Guess two avenues then as needed:
Option 1:
Container Disk > Batching > Producer/Consumer pattern
Option 2:
Container Disk > file system files > randomly select
gl gl hopefully works out
Just as an FYI, if u have a bunch of mini files in ur network storage
it can actually eat up network storage space
(so if u ever wonder about creating 100K files in network storage)
network storage has a weird block thing in a different thread, where they have some "minimum file size" for some reason, forgot, so even if u have lets say 100gb worth of tiny files, if they are below that block size, each file eats up a minimum space meaning u could be eating up 200Gb in network storage
(forgot where i read it, but one the staff said it when ppl complained about network storage eating up space more than the files should be)
*Not an issue on container disk tho / if it isnt a network drive
Thanks a lot !!
So yeah, i'm running it now with the data in / and it's going fast so that's the solution
Nice
it's still a little slower in the a100 than in my 3090 for a small model, is that to be expected?
try to do:
šļø
just make sure is correct haha
sanity check
not sure tho
hehe A100 80GB PCIe
ok yeh, looks good, just i saw a bug yesterday where someone got assigned the wrong one, seemed like a one-off but now im careful if i see an unexpected performance drop lol
but cant say tbh haha. maybe on a small model a100 isnt as efficient utilizing all the power
btw are u fine tuning or training?
but gl gl
maybe u can get away with a weaker gpu haha who knows tho
looks like 3090 has more 'cores'
it makes some sense that one is faster the other is bigger
I'm pretraining now
Depends on cloud type, region, etc etc
Also there are no cores, its vcpu and a vcpu is a thread not a core
so basically 2 vpcu is the equivalent of 1 cpu core