C
C#4w ago
Tracer

Serialize HashSet into a binary file

Hello. I want to serialize a hashset into a binary file. I tried using BinaryFormatter but its obsolete. I dont want to save it as xml or json but as pure bytes. How could I do it?
enum Test {
A = 1,
B = 2,
C = 3
}

HashSet<Test> set = new HashSet<Test>();

set.Add(Test.A);
set.Add(Test.C);
set.Add(Test.B);

// How to do that?
Serialize(set, "myfile.bin");
enum Test {
A = 1,
B = 2,
C = 3
}

HashSet<Test> set = new HashSet<Test>();

set.Add(Test.A);
set.Add(Test.C);
set.Add(Test.B);

// How to do that?
Serialize(set, "myfile.bin");
128 Replies
FestivalDelGelato
you have to choose a binary format, bson for example
Tracer
TracerOP4w ago
What are the available formats?
FestivalDelGelato
that's a list too long to know, there are too many you could even write your own you could even just zip a json
Tracer
TracerOP4w ago
what about just writing HashSet's data as bytes into a file?
FestivalDelGelato
you can take all the private fields of an object and write them in a file (and then do the reverse procedure when reading the file), but it's risky because if you screw up something then data is unreadable, and also if framework gets updated that data is unusable
Tracer
TracerOP4w ago
how would it be unreadable?
FestivalDelGelato
it's not human readable, so good luck to find the structure of the data again if there's a bug in the algorithm that writes it consider there's to write the address/name of the fields, loops because there will be arrays, eventually types signatures if you used generics, and so on (nested types, whatever)
Tracer
TracerOP4w ago
the algorithm should not change, at least not in this specific case (write raw data to binary file). I come from C++ where serialization is as easy as printing stuff to the console, so I dont know all the quirks of C# here
FestivalDelGelato
if you want to use unsafe context, get a pointer, and then write the size of the object, i believe you can if you really wish to do so (to a limited extent) (keeping in mind this should not be used in a production environment, BinaryFormatter has been removed for a reason)
Tracer
TracerOP4w ago
Well, that wont be an issue as Im not writing any production code atm But how could I get BinaryFormatter to actually work even when its obsolete? Just disabling warnings doesnt work as the debugger errors on the construction of BF
becquerel
becquerel4w ago
it might be helpful if you explain the root problem you are trying to solve here trying to get binary formatter to work isn't a good idea - it's just not worth it if you google 'binary serialization .net core nuget' you can find multiple packages which work on modern .net i believe my work uses some messagepack library
FestivalDelGelato
(simplest way would be compiling for .net 7, although i wouldn't call it a 'solution')
becquerel
becquerel4w ago
this is true, but not really the right approach morally
Tracer
TracerOP4w ago
My root problem is to serialize a HashSet, to view its raw bytes and compare them with other HashSet Hey, as long as it works... Btw why was BinaryFormatter even deprecated? Whats so dangerous about it?
FestivalDelGelato
bunch of reason, security for example, people shooting themselves in the feet, compatibility between versions i believe also wouldn't it make more sense to compare json for example don't know which data you have but still..
Tracer
TracerOP4w ago
The HashSet im comparing my HashSet with is only available as raw bytes, so cant use json or xml here
FestivalDelGelato
so it's not structured data
Tracer
TracerOP4w ago
If by structured data you mean not raw bytes then yes, its not
SteveTheMadLad
For the security aspect, you might find this GH issue enlightening: https://github.com/dotnet/runtime/issues/98245 As for the problem at hand, if the HashSet contains primitive objects, you could think of using BinaryWriter this way:
using var writer = new BinaryWriter(someStream, Encoding.Whatever);

foreach (var v in hashSet)
writer.Write((int)v);
using var writer = new BinaryWriter(someStream, Encoding.Whatever);

foreach (var v in hashSet)
writer.Write((int)v);
That's very barebones but gets you most of the way there, the result is a binary file that you can then read with BinaryReader (or just simply read the thing into a byte[] and compare byte-by-byte). Reading your latest messages as I was typing this up, it appears you already have a binary file, but then it depends on how that file was written, and if it's a proper binary format with more than just the hashset entries, you're well and truly screwed if you don't know the specs. Lastly, I did a cast to int because that's the default underlying type for enums if you leave it unspecified, but it could be any of the integer types. EDIT to add: do note that you can do a binary serialization even if it's something other than primitive types, it just takes a bit more design when deciding how to lay stuff out.
GitHub
Announcement: BinaryFormatter is being removed in .NET 9 · Issue #9...
Ever since .NET Core 1.0, we in .NET Security have been trying to lay BinaryFormatter to rest. It’s long been known by security practitioners that any deserializer, binary or text, which allows its...
Tracer
TracerOP4w ago
BinaryWriter is not obsolete, right? :blobsweat:
SteveTheMadLad
I would certainly hope it isn't, otherwise a few heads would end up on a pike! Here's a link to the docs in case they help: https://learn.microsoft.com/en-us/dotnet/standard/io/how-to-read-and-write-to-a-newly-created-data-file
How to: Read and write to a newly created data file - .NET
Learn how to read and write to a newly created data file in .NET using the System.IO.BinaryReader and System.IO.BinaryWriter classes.
SteveTheMadLad
But just to reiterate, if you don't have control over the binary file you already have or don't know how it's laid out, there isn't much you can do. Reverse engineering a binary format takes skill, luck, a lot of time and patience and dedication.
Tracer
TracerOP4w ago
Well, I know how the memory is laid out in the file, I have the type of an object that gets dumped into memory and its not untrusted either (it kinda is but not at the same level as accepting whatever data is available)
SteveTheMadLad
Oh okay, then if you know exactly the format you can produce your own binary file laid out identically, and then if you just care about whether they're different and don't need to do anything more sophisticated, a simple byte-by-byte comparison will do it (and even maybe optimize for the case where the respective lengths are different). Hell, given the two byte arrays you could even pass them to a SHA-family hash function and compare hashes for minimum effort. Would be interesting to benchmark vs writing your own for loop, actually.
Tracer
TracerOP4w ago
Does HashSet by itself introduce some member variables that are shown in memory (ram)? If so, thats also important for me as I want to capture the entire class
SteveTheMadLad
Well, it certainly has some internal state (for sure Capacity and a Comparer). But if you know the layout of the file, you also know whether it contains just the items in the set or something else too, no?
Tracer
TracerOP4w ago
Ehh, not really. I mean, sure I know the layout but its packed of system types which have their own size, internals and stuff. That also adds into the file Okay, maybe Im going XY. Let me rephrase the issue I have a dump of memory of some app. I have a disassembly and the related class that contains the data I seek. One of its members is a HashSet of enums (above). I would assume the internals would show up in the memory apart from the actual values. For trivial types such as int, float, double its not an issue as their byte representation is just the value of such type. I need to dump my own HashSet in order to compare it with the dumped one from the other app to see what values it has. I know the possible values but I dont know what the actual values it stores at the moment I dump its memory. I wanted to write it to a file as that would be the easiest way to compare these bytes. If its not possible to dump the whole class but only the public members then maybe there is a way to dump the pointer (thats pointing to the HashSet) value and either display it in console or write to file or whatever
SteveTheMadLad
Ah, then it really depends on what the tool doing the dump is throwing in there :/ I suspect this may be a losing battle, but I honestly don’t know much at all about memory dumps.
Tracer
TracerOP4w ago
Lets just assume we analyze the memory itself. In my case it only dumps raw bytes from region X to region Y (i.e. Dump everything from 0x1000 to 0x2000) So we just look at the object in memory
SteveTheMadLad
Yeah but do you know exactly how it's serialized? Like, if it's a set of strings but for some reason they're encoded in UTF-16, they're taking up 2x the length of a string, just as an example. Another question would be: why do you need to do this? If you can say. I'm asking purely because someone more experienced might be able to suggest a different approach altogether. Like, I'm not sure whether this applies to you, but if you paste the following prompt into ChatGPT, it does come up with a few suggestions:
if i have a .net memory dump containing a hashset among other things, is there a way I can build a hashset that is perfectly identical?
Whether it's hallucinating or not, is up to you to figure out I'm afraid :KEKW:
Tracer
TracerOP4w ago
Well, in memory it doesnt matter how its encoded as there are no encodings in ram but I understand why you're asking. Its UTF8 with little endianess saved (as nowadays most desktops use LE for some reason) There is sizeof in C#, right? Could I use that + pointer to HashSet to get its value from memory? (ptr + sizeof(data) would give me end iterator) From there i could just look at the mem region, dump bytes by hand and be done
SteveTheMadLad
C# does have sizeof, yes. sizeof(Test) taking the enum from your example (or any enum whose underlying type is int) will return 4.
Tracer
TracerOP4w ago
Thanks all. I will test it and get back to you hmm, I was searching for ways to get size of a variable and I encountered Marshal.SizeOf. Sadly, it cant calculate the size for managed objects. So.. how to actually get that? Other than by looking at raw memory and essentially guess where does the object memory end
Anton
Anton4w ago
@Tracer It's implementation defined — might be different on different machines. Does that method work if you add an explicit layout attribute?
Tracer
TracerOP4w ago
you mean [StructLayout(LayoutKind.Explicit)]?
Anton
Anton4w ago
yes that attribute, with anything other than auto in the constructor explicit or sequential
Tracer
TracerOP4w ago
Its only valid on class or struct declarations :/
Anton
Anton4w ago
auto might actually mean the struct default here I'm not sure well yeah put it on the data class
Tracer
TracerOP4w ago
okie dokie
Anton
Anton4w ago
as in, reference types, then value types
Tracer
TracerOP4w ago
Okay, Im probably doing something wrong :/
using System.Runtime.InteropServices;

namespace Program {
enum Test {
A = 1,
B = 2,
C = 3
}

[StructLayout(LayoutKind.Explicit)]
class Program {
static void Main() {
HashSet<Test> set = new HashSet<Test>();

set.Add(Test.A);
set.Add(Test.C);
set.Add(Test.B);

unsafe {
UInt64 start = (UInt64) (UInt64*) &set;
UInt64 size = (UInt64) sizeof(HashSet<Test>);
UInt64 end = start + size;
int size2 = Marshal.SizeOf(set);
Console.WriteLine($"Addr: 0x{start:X} - 0x{end:X} | {size}");
int a = 5; // breakpoint for debugging
}
}
}
}
using System.Runtime.InteropServices;

namespace Program {
enum Test {
A = 1,
B = 2,
C = 3
}

[StructLayout(LayoutKind.Explicit)]
class Program {
static void Main() {
HashSet<Test> set = new HashSet<Test>();

set.Add(Test.A);
set.Add(Test.C);
set.Add(Test.B);

unsafe {
UInt64 start = (UInt64) (UInt64*) &set;
UInt64 size = (UInt64) sizeof(HashSet<Test>);
UInt64 end = start + size;
int size2 = Marshal.SizeOf(set);
Console.WriteLine($"Addr: 0x{start:X} - 0x{end:X} | {size}");
int a = 5; // breakpoint for debugging
}
}
}
}
Anton
Anton4w ago
what are you even doing I thought you had a class model there's no way you can do that, I'm pretty sure
Tracer
TracerOP4w ago
No, I want to get the size of the HashSet to then copy the bytes from debugger based on the memory region (start, start+size)
Anton
Anton4w ago
hold on so is this for serialization into a file? or just to like get the byte representation of the object?
Tracer
TracerOP4w ago
get the byte representation and then dump it into a file as byte[] i guess
Anton
Anton4w ago
it has pointers internal pointers I'm sure
Tracer
TracerOP4w ago
the HashSet you mean?
Anton
Anton4w ago
yes even if it doesn't, you must not rely on it it's an implementaion detail serialize it as an array
Tracer
TracerOP4w ago
I mean, I dont rely on internal pointers. I rely on the actual space it takes in memory
Anton
Anton4w ago
then reconstruct it when you read it back that's the best you can do the HashSet implementation might it might have linked lists how are going to serialize that? how are you going to serialize array fields? if you just get the bytes? they are just pointers to some other memory
Tracer
TracerOP4w ago
Im fine with that at least for now
Anton
Anton4w ago
dude it's going to end up as an invalid object when you read it back from the file do you realize that?
asdf
asdf4w ago
Just do this
Tracer
TracerOP4w ago
I dont want to read it back to the file
Anton
Anton4w ago
you'll have internal invalid pointers ok
Tracer
TracerOP4w ago
the foreach loop will only dump the "frontend" of the HashSet and not the whole hashset
Anton
Anton4w ago
I think there's no such API because it's an implementation detail there might be some internal methods for this look at the .net source code
asdf
asdf4w ago
What is your use case
Tracer
TracerOP4w ago
I want to dump the whole HashSet from memory to file. Thats all i.e. the entire chunk of memory the HashSet lives in, I want to dump its bytes into a file
Anton
Anton4w ago
you won't see the value though you'll see addresses
Tracer
TracerOP4w ago
thats fine for now its fine
Anton
Anton4w ago
yeah so look at the source code for .net for internal methods there's no public stuff for this in the standard lib and will never be
Tracer
TracerOP4w ago
:sadge: how would I call the internal methods though?
Anton
Anton4w ago
through reflection ofc you might also look at the spec it has the implementation details for the object layouts and then reimplement that but calling into existing functions is probably less work even if it's through reflection
arion
arion4w ago
what here is the goal? just simply unreadable? serialize to json, convert string to bytes, add offset to each byte, repeat steps in reverse to reserialize
Tracer
TracerOP4w ago
Again, I dont want to serialize it to json as that serves no purpose. Thats not what I want. When you create an object, it lives in memory as raw bytes. I want to dump the whole HashSet object from memory to disk as raw bytes
arion
arion4w ago
In C#, a hashset is a managed object its not a struct without references
Tracer
TracerOP4w ago
it still lives in memory
arion
arion4w ago
thats not the point
Tracer
TracerOP4w ago
actually it is
arion
arion4w ago
if you copy a chunk of managed memory and load it back in guess what happens? u have dead pointers whats happens when u access them?
Tracer
TracerOP4w ago
I have never said anything about loading it back
arion
arion4w ago
AV exception
Tracer
TracerOP4w ago
I know how pointers work
arion
arion4w ago
well whats the point then if u not going to load it back?
Tracer
TracerOP4w ago
Please re-read my convo. The point is to not serialize it back and fourth. The point is to read it once and see how its laid in memory
arion
arion4w ago
here is how its laid out in memory replace managed objects with pointers
Tracer
TracerOP4w ago
I meant as pure bytes, the size of the structure, etc.
arion
arion4w ago
the size is dynamic since its a generic class
Tracer
TracerOP4w ago
I know
arion
arion4w ago
also array is dynamic
Tracer
TracerOP4w ago
but the dynamicity of the structure only affects the actual data. The actual hashset object is static and it points to dynamic data
arion
arion4w ago
then follow it, each field there lives somewhere on the heap. Arrays are managed, it contains additional info. tl;dr a hashset is not a contiguous piece of memory
Tracer
TracerOP4w ago
its data can (and probably will) point to other places but the hashset itself is contiguous, no? otherwise, how would one unserialize it? 🙂 I understand HashSet has internal pointers going everywhere, Thats fine. But I currently want to get only the hashset object. Thats all
SteveTheMadLad
Yes, as far as I know, then of course it can point to other locations for the actual contents. “How would one deserialize it” is barking up the wrong tree imo, because as far as I know e.g. a json serializer will create one for you and just add stuff to it, it’s not like it will look at the internals (which aren’t in a json to begin with), and anyway getting rid of the chance someone would be able to craft a malicious serialized binary to interfere with object creation was exactly why BinaryFormatter was removed. You still haven’t said WHY you want to do this, unless I’ve missed it.
arion
arion4w ago
the "hashset itself" is a reference type the _entries field is an array (managed type)
private struct Entry
{
public int HashCode;
/// <summary>
/// 0-based index of next entry in chain: -1 means end of chain
/// also encodes whether this entry _itself_ is part of the free list by changing sign and subtracting 3,
/// so -2 means end of free list, -3 means index 0 but on free list, -4 means index 1 but on free list, etc.
/// </summary>
public int Next;
public T Value;
}
private struct Entry
{
public int HashCode;
/// <summary>
/// 0-based index of next entry in chain: -1 means end of chain
/// also encodes whether this entry _itself_ is part of the free list by changing sign and subtracting 3,
/// so -2 means end of free list, -3 means index 0 but on free list, -4 means index 1 but on free list, etc.
/// </summary>
public int Next;
public T Value;
}
if T is an unmanaged type the layout for it will be a contiguous part above, depending on the size of your unmanaged T type it might be first or last in memory ISerializable exists but that would be serializing the data which you said you dont want. This "hashset object" you speak of is probably referring to my image above they dont want to serialize it since it will change the way it looks in memory or something along those lines they want a runtime interpretation of the memory at hand if i am understanding this correct
SteveTheMadLad
I get that, what we still don’t know is what the point of all this is
arion
arion4w ago
they want to know the memory layout of the object ig
Tracer
TracerOP4w ago
I have a dump of memory of other process with HashSet<enum> there. I want to get the boundaries of the hashset (its surrounded by other members but thats out of the scope here). In order to do that I need to read the HashSet object from memory
arion
arion4w ago
short answer to that is: yes
Tracer
TracerOP4w ago
something along those lines, yes
arion
arion4w ago
try asking in #allow-unsafe-blocks for "process dump to Hashset<unmanaged type> contents", though I doubt you'll like the answer they give they understand the runtime way more than anyone in #help most likely also for note, this will pretty much look like
pointer (4 or 8 bytes in length)
pointer (4 or 8)
int (4)
int (4)
int (4)
int (4)
pointer (4 or 8 bytes in length)
pointer (4 or 8)
int (4)
int (4)
int (4)
int (4)
possibly also shuffled around depending on pointer sizes if you'd rather look in _entries it might be better. but the "hashset itself" is pretty much just that
Tracer
TracerOP4w ago
thats what I need so the size would be roughly 32 bytes...? if 64bits
arion
arion4w ago
_entries is an Entry[] an entry is
private struct Entry
{
public int HashCode;
/// <summary>
/// 0-based index of next entry in chain: -1 means end of chain
/// also encodes whether this entry _itself_ is part of the free list by changing sign and subtracting 3,
/// so -2 means end of free list, -3 means index 0 but on free list, -4 means index 1 but on free list, etc.
/// </summary>
public int Next;
public T Value;
}
private struct Entry
{
public int HashCode;
/// <summary>
/// 0-based index of next entry in chain: -1 means end of chain
/// also encodes whether this entry _itself_ is part of the free list by changing sign and subtracting 3,
/// so -2 means end of free list, -3 means index 0 but on free list, -4 means index 1 but on free list, etc.
/// </summary>
public int Next;
public T Value;
}
T being your type yea, though there's probably thousands of Hashsets in the runtime at one point in time nothing in there being something unique enough to be a pattern
Tracer
TracerOP4w ago
the pattern is the memory location, that + the size of the hashset would be sufficient
arion
arion4w ago
nvm
Tracer
TracerOP4w ago
in C++ I would just read the bytes directly and/or read the size with a debugger
struct Example {
int a;
std::vector<float> b;
double c;
std::string d;
int e;
};
struct Example {
int a;
std::vector<float> b;
double c;
std::string d;
int e;
};
When I have only the offset to e, I can easily get all members by just sizeof'ing everything
arion
arion4w ago
here's an example
arion
arion4w ago
by the way @Tracer, if you want the underlying data for retrieval purposes instead of what you described above dotMemory (and maybe similar things) allow u to find instances eg.
Tracer
TracerOP4w ago
thats JB Rider, right?
arion
arion4w ago
while dotMemory isnt free, im sure there are alternatives no, this is dotMemory another program by JetBrains different from Rider
Tracer
TracerOP4w ago
oh, thought its a part of Rider, nvm it then
arion
arion4w ago
its not a part of rider for note on the proc dump above i did a FULL process dump results may vary based on partial dumps
Tracer
TracerOP4w ago
whats the diff between bytes and retained bytes?
Tracer
TracerOP4w ago
and the bytes is just the size of the object excluding the values its pointers point to?
arion
arion4w ago
yes its that 64 bytes we talked about previously
Tracer
TracerOP4w ago
so thats the value Im looking for alr
arion
arion4w ago
or just throw the thing into dotMemory, search for all instances of HashSet<YourType> if your intention is data retrieval
Tracer
TracerOP4w ago
do you know if maybe VS has some similar functionality for that?
arion
arion4w ago
they probably do i just dont know of it since i try my best to avoid VS like its the plague (since it kinda is)
Tracer
TracerOP4w ago
why?
arion
arion4w ago
its fat bloated
Tracer
TracerOP4w ago
you can select only what you use ¯\_(ツ)_/¯
arion
arion4w ago
I've sold my soul to IntelliJ IDEs, imo its worth it :Kekw:
Tracer
TracerOP4w ago
its paid, so that will always be a huge difference between that and VS
arion
arion4w ago
nope that's changed
Tracer
TracerOP4w ago
right, clion is free for nc use Apparently it isnt..?
arion
arion4w ago
but personally im still on a paid plan rider is now too
Tracer
TracerOP4w ago
yeah that was a big news with rider though I wont use it clion is still paid with 30d trial. If they change that to free for nc use then VS is gonna have some problems with users
arion
arion4w ago
yea, it doesnt show as nc for me on the toolkit sadly

Did you find this page helpful?