Processing files as fast as possible
Hi everyone! I have a question that I’m sure you can help me with. I’m making a library that needs to parse files as fast as possible. The files in question are binary files containing a sort of protobuf but not quite protobuf.
When parsing a file I only go forward through the file one time, and the data is not modified at all.
I want to know what is the fastest way to go through the file.
At the moment I’m using
MemoryStream
s, but when I want to process a part of what I just read I have to allocate a new byte[]
and copy the data to it, and I think this can be avoided.
99% of the files I handle won't ever exceed 150MB, maybe even 200MB, so memory isn’t a problem. But the fewer memory I need to allocate apart from the file I’m reading the better. The ideal scenario would be for me to only have one copy of the data in memory, and then be able to reference said data by creating read-only slices to process it.
Is this even possible? Is there a performance overhead for using MemoryStream
instead of byte[]
?27 Replies
Maybe you could peek into the
MemoryStream
with Memory<T>
or Span<T>
to avoid allocating a new byte[]
?What do you mean by peek? I know that I could use
Memory<byte>
or ReadOnlyMemory<byte>
as a way of storing the file data, so I can create references to it without copying it. But can this be done as well with MemoryStream
s?I'm not sure if it can be done directly on
MemoryStream
s, but Memory
and Span
are the only ways to take a peek into the, well, memory without allocating a new byte[]
or anything else
At least as far as I know
You can always try asking in #advanced, they might have some ways of utilizing unsafe
blocks and what have youOkay, I'll look further into what you said, thanks! I'm going to ask for some help on #advanced as well then
you can either load the entire file with File.ReadAllBytes and then process it one chunk at a time with Memory<T>
or you can use FileStream and read n bytes at a time using
Read
method with Span<byte> overload
that span can be allocated on the stack with stackalloc
if you want to, that's practically free to do with constant size parameterGenerally you don't need
MemoryStream
at all. Just use the original FileStream
, and Read()
/ReadAsync()
into a buffer that you get from a ArrayPool.Shared
or a MemoryPool
. Once you've read it, it's yours to parse as you will.
I would avoid stackalloc
for this type of read, due to the size. stackalloc
is best for <1kb arrays, and the best size for reading from FileStream
varies between 4kb and 64kb.icic
I'm not familiar with
ArrayPool.Shared
, but doesn't it allocatee new memory every time I pass an array?Depends on how you use it. If you ask for 256MB arrays each time, then yes. If you ask for 64kb array, then no - it will keep the reference around for a while and so when you ask for it again you'll get the same one back.
The biggest question I guess is whether you need the entire file in memory at once or not.
But wouldn't using
Memory<byte>
be better as it allows me to have just one copy of the data at all times?if you need the entire file in memory, then yeah, just
File.ReadAllBytes(Async)()
that's... orthogonal?
regardless of how you get the array, if you keep it for the length of processing, then you'll have "just one copy fo the data at all times"
I benefit more by having a 100MB file in memory than having to wait for the I/O when I decide to do
FileStream.Read()
eh... $itdepends
I've seen many algorithms improve by kicking off a read, and doing some processing while the read is executing in the background.
again, depending on how much you need at a time
Pretty small, the chunks of data I read are on average 10KB (I have not tested this, it's just a guess)
But I know I never exceed 100Kb for example
ok, then I would, as a way to test, rent a 128kb array.
await
reading the first 128kb. then start the task to read the second 128kb. while the task is running, process the 128kb. when you get done processing it, await
the reading task, and repeat, until you get to the end.
then you're reading in parallel to processing.For what I have been testing, right now using
FileStream
to directly read the data is out of the question, it is by far the slowest method
My problem is that arrays are constantly being sub-divided into smaller onesoh, then that's your problem. once you have the array, create an
ArraySegment
from it.
the ArraySegment
is just a pointer to the array, but it can be divided up without allocating new smaller arrays.Got it, well the only thing left to do is to try all these things a find the one that better fits my needs. Thanks!
hey, i know im coming late to the party, but what would u think about something like this?
https://paste.mod.gg/zwkfpbybcqil/0
its just a draft implementation, so stuff like clean up and alike arent shown yet (its also currently using spin waits)
the basic idea is, to rent an array that can fit the whole file,
then wrap it in a memory stream and start copying from file to memory stream asynchronously without awaiting it
as core functionality u have 2 methods,
one that waits until N bytes are available,
and another to get a span and advance the internal position
i added a sync and an async example usage method at the end, which will read a length prefixed byte array and read it as utf8 encoded string
the file sized backed buffer array is mainly to avoid having to care about (i think its called) tearing the data
Not late at all! Thank you very much for the example implementation, I will check it out
I'd be concerned with
await Task.Yield();
, as in a low-contention environment, this could essentially just turn into a cpu busy-loop, as the task scheduler just reschedules the current task while theh reading task is doing it's thing. if length
is large, this could take some time.yeah i would add some kind of behaviour how to await/wait to have some more fine grained control
or maybe even with events, but for that ya would have to wrap the stream again which is sorta outside this little draft
Is there a data interleaving concern with leasing arrays where doing things "wrong" works okay, but at scale produces interleaving problems? Certain performance python libs have interleaving issues at the higher end of scale.