C#•2y ago

Processing files as fast as possible

Hi everyone! I have a question that I’m sure you can help me with. I’m making a library that needs to parse files as fast as possible. The files in question are binary files containing a sort of protobuf but not quite protobuf. When parsing a file I only go forward through the file one time, and the data is not modified at all. I want to know what is the fastest way to go through the file. At the moment I’m using MemoryStreams, but when I want to process a part of what I just read I have to allocate a new byte[] and copy the data to it, and I think this can be avoided. 99% of the files I handle won't ever exceed 150MB, maybe even 200MB, so memory isn’t a problem. But the fewer memory I need to allocate apart from the file I’m reading the better. The ideal scenario would be for me to only have one copy of the data in memory, and then be able to reference said data by creating read-only slices to process it. Is this even possible? Is there a performance overhead for using MemoryStream instead of byte[]?

27 Replies

Angius•2y ago

Maybe you could peek into the MemoryStream with Memory<T> or Span<T> to avoid allocating a new byte[]?

jotalanusseOP•2y ago

What do you mean by peek? I know that I could use Memory<byte> or ReadOnlyMemory<byte> as a way of storing the file data, so I can create references to it without copying it. But can this be done as well with MemoryStreams?

Angius•2y ago

I'm not sure if it can be done directly on MemoryStreams, but Memory and Span are the only ways to take a peek into the, well, memory without allocating a new byte[] or anything else At least as far as I know You can always try asking in #advanced, they might have some ways of utilizing unsafe blocks and what have you

jotalanusseOP•2y ago

Okay, I'll look further into what you said, thanks! I'm going to ask for some help on #advanced as well then

JenyaRostov•2y ago

you can either load the entire file with File.ReadAllBytes and then process it one chunk at a time with Memory<T> or you can use FileStream and read n bytes at a time using Read method with Span<byte> overload that span can be allocated on the stack with stackalloc if you want to, that's practically free to do with constant size parameter

viceroypenguin•2y ago

Generally you don't need MemoryStream at all. Just use the original FileStream, and Read()/ReadAsync() into a buffer that you get from a ArrayPool.Shared or a MemoryPool. Once you've read it, it's yours to parse as you will. I would avoid stackalloc for this type of read, due to the size. stackalloc is best for <1kb arrays, and the best size for reading from FileStream varies between 4kb and 64kb.

JenyaRostov•2y ago

icic

jotalanusseOP•2y ago

I'm not familiar with ArrayPool.Shared, but doesn't it allocatee new memory every time I pass an array?

viceroypenguin•2y ago

Depends on how you use it. If you ask for 256MB arrays each time, then yes. If you ask for 64kb array, then no - it will keep the reference around for a while and so when you ask for it again you'll get the same one back. The biggest question I guess is whether you need the entire file in memory at once or not.

jotalanusseOP•2y ago

But wouldn't using Memory<byte> be better as it allows me to have just one copy of the data at all times?

viceroypenguin•2y ago

if you need the entire file in memory, then yeah, just File.ReadAllBytes(Async)() that's... orthogonal

jotalanusseOP•2y ago

viceroypenguin•2y ago

regardless of how you get the array, if you keep it for the length of processing, then you'll have "just one copy fo the data at all times"

jotalanusseOP•2y ago

I benefit more by having a 100MB file in memory than having to wait for the I/O when I decide to do FileStream.Read()

viceroypenguin•2y ago

eh... $itdepends

MODiX•2y ago

https://cdn.discordapp.com/attachments/569261465463160900/826859856719380500/1niu5wN.png

viceroypenguin•2y ago

I've seen many algorithms improve by kicking off a read, and doing some processing while the read is executing in the background. again, depending on how much you need at a time

jotalanusseOP•2y ago

Pretty small, the chunks of data I read are on average 10KB (I have not tested this, it's just a guess) But I know I never exceed 100Kb for example

viceroypenguin•2y ago

ok, then I would, as a way to test, rent a 128kb array. await reading the first 128kb. then start the task to read the second 128kb. while the task is running, process the 128kb. when you get done processing it, await the reading task, and repeat, until you get to the end. then you're reading in parallel to processing.

jotalanusseOP•2y ago

For what I have been testing, right now using FileStream to directly read the data is out of the question, it is by far the slowest method My problem is that arrays are constantly being sub-divided into smaller ones

viceroypenguin•2y ago

oh, then that's your problem. once you have the array, create an ArraySegment from it. the ArraySegment is just a pointer to the array, but it can be divided up without allocating new smaller arrays.

jotalanusseOP•2y ago

Got it, well the only thing left to do is to try all these things a find the one that better fits my needs. Thanks!

cap5lut•2y ago

hey, i know im coming late to the party, but what would u think about something like this? https://paste.mod.gg/zwkfpbybcqil/0 its just a draft implementation, so stuff like clean up and alike arent shown yet (its also currently using spin waits) the basic idea is, to rent an array that can fit the whole file, then wrap it in a memory stream and start copying from file to memory stream asynchronously without awaiting it as core functionality u have 2 methods, one that waits until N bytes are available, and another to get a span and advance the internal position i added a sync and an async example usage method at the end, which will read a length prefixed byte array and read it as utf8 encoded string the file sized backed buffer array is mainly to avoid having to care about (i think its called) tearing the data

jotalanusseOP•2y ago

Not late at all! Thank you very much for the example implementation, I will check it out

viceroypenguin•2y ago

I'd be concerned with await Task.Yield();, as in a low-contention environment, this could essentially just turn into a cpu busy-loop, as the task scheduler just reschedules the current task while theh reading task is doing it's thing. if length is large, this could take some time.

cap5lut•2y ago

yeah i would add some kind of behaviour how to await/wait to have some more fine grained control or maybe even with events, but for that ya would have to wrap the stream again which is sorta outside this little draft

Mayor McCheese•2y ago

Is there a data interleaving concern with leasing arrays where doing things "wrong" works okay, but at scale produces interleaving problems? Certain performance python libs have interleaving issues at the higher end of scale.

Gaming

Programming

Processing files as fast as possible

Did you find this page helpful?