C
C#2mo ago
Jesse

Handling reading large files in C#

How do I go about reducing memory usage to handle large files in x86 C#? My current code is as follows:
FileStream fs = new FileStream(filepath, FileMode.Open, FileAccess.Read);
using (BinaryReader br = new BinaryReader(fs))
{
byte[] bytes = new byte[0];
using (MemoryStream test = new MemoryStream())
{
fs.CopyTo(test);
bytes = test.ToArray();
}

byte[] searchBytes = Encoding.UTF8.GetBytes("test");
List<long> positions = new List<long>();

foreach(long pos in Extensions.SearchStringInBytes(bytes, searchBytes))
{
positions.Add(pos - 4);
}
}
FileStream fs = new FileStream(filepath, FileMode.Open, FileAccess.Read);
using (BinaryReader br = new BinaryReader(fs))
{
byte[] bytes = new byte[0];
using (MemoryStream test = new MemoryStream())
{
fs.CopyTo(test);
bytes = test.ToArray();
}

byte[] searchBytes = Encoding.UTF8.GetBytes("test");
List<long> positions = new List<long>();

foreach(long pos in Extensions.SearchStringInBytes(bytes, searchBytes))
{
positions.Add(pos - 4);
}
}
When reading a large file (>500MB) the memory usage skyrockets to 2GB. The result is that it only works in x64 build, as x86 results in a OutOfMemoryException near 1GB memory usage. I have thought of reading the file in "chunks" but I'm not sure how. Any other suggestions aside from making the program x64 only?
36 Replies
Pixel
Pixel2mo ago
so from my understand: You get a file stream, you copy it to a memorystream (2x memory usage) and then copy it to an array (3x usage) This isn't great you should be able to read a chunk (there's a function for it, i don't dick around with streams that often but you can choose how many bytes to read) for say, 1024 bytes and then store that, scan it, and then free the memory (by overwriting the byte[] storing the chunk)
boiled goose
boiled goose2mo ago
yeah, it depends how long is the thing you are searching
Pixel
Pixel2mo ago
infact i believe you can just use Read(Span<Byte>), have a byte[] of size 1024 (or whatever chunk size) then Read(byte_arr) byte_arr will have the bytes, check if it has what you are looking for, and then keep going, the only issue with this is if the text you are looking for is inbetween 2 chunks but that's an easy enough fix
Jesse
Jesse2mo ago
@Pixel I don't think I'm understanding exactly what you mean Also you mentioned a function, but I have no idea what function
Foxtrek_64
Foxtrek_642mo ago
I actually just solved this problem at work - let me grab my code
Foxtrek_64
Foxtrek_642mo ago
Ah yes, so the TL;DR is you want to open the file as a MemoryMappedFile
Memory-Mapped Files - .NET
Explore memory-mapped files in .NET, which contain file contents in virtual memory, and allow applications to modify the file by writing directly to the memory.
Foxtrek_64
Foxtrek_642mo ago
GitHub
Tafs.Activities/src/Tafs.Activities.FileChunks at main · TA-RPA/Taf...
A collection of activities and helpers for UiPath. - TA-RPA/Tafs.Activities
Foxtrek_64
Foxtrek_642mo ago
Includes a Chunk, a ChunkIterator, and a ReverseChunkIterator I'll get this package published to nuget since I realized it's not there yet It will be available here in a few minutes: https://www.nuget.org/packages/Tafs.Activities.FileChunks/0.1.0 The way that you'll use this is you'll init a new chunk iterator with the path to the file and the chunk lenght, then you can iterate through using a foreach loop or LINQ Do note the chunk iterator is disposable, so do wrap it in a using block/statement
Jesse
Jesse2mo ago
I only need a way to reduce memory usage on a ±500mb, that's all 😅 Also, your code is (A)GPL. I can't work with that unfortunately
Foxtrek_64
Foxtrek_642mo ago
It's been published Also I don't mind relicensing. LGPL is just the default for Remora projects
Jesse
Jesse2mo ago
Doesn't work on .NET Framework 4.7.2, also a issue
Pixel
Pixel2mo ago
why are you using such old .NET?
Jesse
Jesse2mo ago
Backwards compatibility for other systems
Pixel
Pixel2mo ago
wym you targetting windows 2000?
Jesse
Jesse2mo ago
If it were for me to not care about backwards compatibility, I would've long switched to x64-build already
Foxtrek_64
Foxtrek_642mo ago
It targets netstandard2.1 so it should in theory be able to use that target. I can add an explicit net461 target if you need (that's the only one I have installed currently)
Jesse
Jesse2mo ago
at least windows 8.1
Foxtrek_64
Foxtrek_642mo ago
Actually let me see when memorymappedfile is available before I offer that- Available since 4.0 flat, so that should be fine
Jesse
Jesse2mo ago
No description
Jesse
Jesse2mo ago
yep
Foxtrek_64
Foxtrek_642mo ago
Does the MIT license work for you?
Jesse
Jesse2mo ago
Very very yes
Foxtrek_64
Foxtrek_642mo ago
Sound good. I'll push a new version here shortly with MIT + net461 support, which you should be able to use with net472
Jesse
Jesse2mo ago
Cool
MarkPflug
MarkPflug2mo ago
Here is your original code annotated to explain why you are seeing so much memory usage:
C#
FileStream fs = new FileStream(filepath, FileMode.Open, FileAccess.Read);
// not sure why you are creating a binaryReader, as you aren't using it...
using (BinaryReader br = new BinaryReader(fs))
{

byte[] bytes = new byte[0];
using (MemoryStream test = new MemoryStream())
{
// This will copy the entire file into the memory stream.
// MemoryStream will dynamically grow, by increasing the internal buffer by 2x
// every time space is exhausted. This means that you'll ultimately use about 2x the memory
// when reading the entire file.
fs.CopyTo(test);
// This allocates a brand new array of exactly the right size and copies
// the bytes from memory stream's intenal buffer to the new array.
bytes = test.ToArray();
}

// The rest of this code is opaque to me, but the "positions" array could grow quite large
// if there are a lot of matches

byte[] searchBytes = Encoding.UTF8.GetBytes("test");
List<long> positions = new List<long>();

foreach (long pos in Extensions.SearchStringInBytes(bytes, searchBytes))
{
positions.Add(pos - 4);
}
}
C#
FileStream fs = new FileStream(filepath, FileMode.Open, FileAccess.Read);
// not sure why you are creating a binaryReader, as you aren't using it...
using (BinaryReader br = new BinaryReader(fs))
{

byte[] bytes = new byte[0];
using (MemoryStream test = new MemoryStream())
{
// This will copy the entire file into the memory stream.
// MemoryStream will dynamically grow, by increasing the internal buffer by 2x
// every time space is exhausted. This means that you'll ultimately use about 2x the memory
// when reading the entire file.
fs.CopyTo(test);
// This allocates a brand new array of exactly the right size and copies
// the bytes from memory stream's intenal buffer to the new array.
bytes = test.ToArray();
}

// The rest of this code is opaque to me, but the "positions" array could grow quite large
// if there are a lot of matches

byte[] searchBytes = Encoding.UTF8.GetBytes("test");
List<long> positions = new List<long>();

foreach (long pos in Extensions.SearchStringInBytes(bytes, searchBytes))
{
positions.Add(pos - 4);
}
}
Foxtrek_64
Foxtrek_642mo ago
With MemoryMappedFile, you get to control the memory usage. By default I have it set to 2mb, but you can change that to whatever you want using the provided LengthsConstants or any long value representing the number of bytes Forgot to hit enter
MarkPflug
MarkPflug2mo ago
The API you probably want is:
byte[] bytes = File.ReadAllBytes(filepath);
byte[] bytes = File.ReadAllBytes(filepath);
Jesse
Jesse2mo ago
yeah that BinaryReader is because of different code that is hidden for readability here
MarkPflug
MarkPflug2mo ago
I would stay away from the complexity of MemoryMappedFile, unless you expect the files to exceed 2GB. Even then, you'd probably be better off adjusting your algorithm to work in a streaming/buffered approach
Jesse
Jesse2mo ago
also could you please add cs after the 3x ` in your message of the code Edit: thank you the file is not expected to be larger than 600mb
MarkPflug
MarkPflug2mo ago
How long will your "search string" typically be? Your example uses "test", is that expected to be representative?
Jesse
Jesse2mo ago
that, works? I expected that not to work when a filestream is using a file already but surprise surprise: it does yes, 4 characters but it works good now, it can run on x86 again
MarkPflug
MarkPflug2mo ago
FileStream will open with FileShare.Read, so other file handles can be opened to read the file. If you try to write to it however, I'd expect that to fail while the FileStream is open.
Jesse
Jesse2mo ago
:thumbsupsmiley: Thank you for the knowledge
Foxtrek_64
Foxtrek_642mo ago
I went ahead and pushed those changes, but I don't really have a good test platform for the older versions of .NET, so anyone who uses this packge for netfx do so at your own risk
Jesse
Jesse2mo ago
imo if Microsoft promises it works on Framework >4.0, it should