C
C#12mo ago
jotalanusse

Processing files as fast as possible

Hi everyone! I have a question that I’m sure you can help me with. I’m making a library that needs to parse files as fast as possible. The files in question are binary files containing a sort of protobuf but not quite protobuf. When parsing a file I only go forward through the file one time, and the data is not modified at all. I want to know what is the fastest way to go through the file. At the moment I’m using MemoryStreams, but when I want to process a part of what I just read I have to allocate a new byte[] and copy the data to it, and I think this can be avoided. 99% of the files I handle won't ever exceed 150MB, maybe even 200MB, so memory isn’t a problem. But the fewer memory I need to allocate apart from the file I’m reading the better. The ideal scenario would be for me to only have one copy of the data in memory, and then be able to reference said data by creating read-only slices to process it. Is this even possible? Is there a performance overhead for using MemoryStream instead of byte[]?
27 Replies
Angius
Angius12mo ago
Maybe you could peek into the MemoryStream with Memory<T> or Span<T> to avoid allocating a new byte[]?
jotalanusse
jotalanusse12mo ago
What do you mean by peek? I know that I could use Memory<byte> or ReadOnlyMemory<byte> as a way of storing the file data, so I can create references to it without copying it. But can this be done as well with MemoryStreams?
Angius
Angius12mo ago
I'm not sure if it can be done directly on MemoryStreams, but Memory and Span are the only ways to take a peek into the, well, memory without allocating a new byte[] or anything else At least as far as I know You can always try asking in #advanced, they might have some ways of utilizing unsafe blocks and what have you
jotalanusse
jotalanusse12mo ago
Okay, I'll look further into what you said, thanks! I'm going to ask for some help on #advanced as well then
JenyaRostov
JenyaRostov12mo ago
you can either load the entire file with File.ReadAllBytes and then process it one chunk at a time with Memory<T> or you can use FileStream and read n bytes at a time using Read method with Span<byte> overload that span can be allocated on the stack with stackalloc if you want to, that's practically free to do with constant size parameter
viceroypenguin
viceroypenguin12mo ago
Generally you don't need MemoryStream at all. Just use the original FileStream, and Read()/ReadAsync() into a buffer that you get from a ArrayPool.Shared or a MemoryPool. Once you've read it, it's yours to parse as you will. I would avoid stackalloc for this type of read, due to the size. stackalloc is best for <1kb arrays, and the best size for reading from FileStream varies between 4kb and 64kb.
JenyaRostov
JenyaRostov12mo ago
icic
jotalanusse
jotalanusse12mo ago
I'm not familiar with ArrayPool.Shared, but doesn't it allocatee new memory every time I pass an array?
viceroypenguin
viceroypenguin12mo ago
Depends on how you use it. If you ask for 256MB arrays each time, then yes. If you ask for 64kb array, then no - it will keep the reference around for a while and so when you ask for it again you'll get the same one back. The biggest question I guess is whether you need the entire file in memory at once or not.
jotalanusse
jotalanusse12mo ago
But wouldn't using Memory<byte> be better as it allows me to have just one copy of the data at all times?
viceroypenguin
viceroypenguin12mo ago
if you need the entire file in memory, then yeah, just File.ReadAllBytes(Async)() that's... orthogonal
jotalanusse
jotalanusse12mo ago
?
viceroypenguin
viceroypenguin12mo ago
regardless of how you get the array, if you keep it for the length of processing, then you'll have "just one copy fo the data at all times"
jotalanusse
jotalanusse12mo ago
I benefit more by having a 100MB file in memory than having to wait for the I/O when I decide to do FileStream.Read()
viceroypenguin
viceroypenguin12mo ago
eh... $itdepends
viceroypenguin
viceroypenguin12mo ago
I've seen many algorithms improve by kicking off a read, and doing some processing while the read is executing in the background. again, depending on how much you need at a time
jotalanusse
jotalanusse12mo ago
Pretty small, the chunks of data I read are on average 10KB (I have not tested this, it's just a guess) But I know I never exceed 100Kb for example
viceroypenguin
viceroypenguin12mo ago
ok, then I would, as a way to test, rent a 128kb array. await reading the first 128kb. then start the task to read the second 128kb. while the task is running, process the 128kb. when you get done processing it, await the reading task, and repeat, until you get to the end. then you're reading in parallel to processing.
jotalanusse
jotalanusse12mo ago
For what I have been testing, right now using FileStream to directly read the data is out of the question, it is by far the slowest method My problem is that arrays are constantly being sub-divided into smaller ones
viceroypenguin
viceroypenguin12mo ago
oh, then that's your problem. once you have the array, create an ArraySegment from it. the ArraySegment is just a pointer to the array, but it can be divided up without allocating new smaller arrays.
jotalanusse
jotalanusse12mo ago
Got it, well the only thing left to do is to try all these things a find the one that better fits my needs. Thanks!
cap5lut
cap5lut12mo ago
hey, i know im coming late to the party, but what would u think about something like this? https://paste.mod.gg/zwkfpbybcqil/0 its just a draft implementation, so stuff like clean up and alike arent shown yet (its also currently using spin waits) the basic idea is, to rent an array that can fit the whole file, then wrap it in a memory stream and start copying from file to memory stream asynchronously without awaiting it as core functionality u have 2 methods, one that waits until N bytes are available, and another to get a span and advance the internal position i added a sync and an async example usage method at the end, which will read a length prefixed byte array and read it as utf8 encoded string the file sized backed buffer array is mainly to avoid having to care about (i think its called) tearing the data
jotalanusse
jotalanusse12mo ago
Not late at all! Thank you very much for the example implementation, I will check it out
viceroypenguin
viceroypenguin12mo ago
I'd be concerned with await Task.Yield();, as in a low-contention environment, this could essentially just turn into a cpu busy-loop, as the task scheduler just reschedules the current task while theh reading task is doing it's thing. if length is large, this could take some time.
cap5lut
cap5lut12mo ago
yeah i would add some kind of behaviour how to await/wait to have some more fine grained control or maybe even with events, but for that ya would have to wrap the stream again which is sorta outside this little draft
Mayor McCheese
Mayor McCheese12mo ago
Is there a data interleaving concern with leasing arrays where doing things "wrong" works okay, but at scale produces interleaving problems? Certain performance python libs have interleaving issues at the higher end of scale.
Want results from more Discord servers?
Add your server