Serialize HashSet into a binary file
Hello. I want to serialize a hashset into a binary file.
I tried using BinaryFormatter but its obsolete. I dont want to save it as xml or json but as pure bytes.
How could I do it?
128 Replies
you have to choose a binary format, bson for example
What are the available formats?
that's a list too long to know, there are too many
you could even write your own
you could even just zip a json
what about just writing HashSet's data as bytes into a file?
you can take all the private fields of an object and write them in a file (and then do the reverse procedure when reading the file), but it's risky because if you screw up something then data is unreadable, and also if framework gets updated that data is unusable
how would it be unreadable?
it's not human readable, so good luck to find the structure of the data again if there's a bug in the algorithm that writes it
consider there's to write the address/name of the fields, loops because there will be arrays, eventually types signatures if you used generics, and so on (nested types, whatever)
the algorithm should not change, at least not in this specific case (write raw data to binary file).
I come from C++ where serialization is as easy as printing stuff to the console, so I dont know all the quirks of C# here
if you want to use unsafe context, get a pointer, and then write the size of the object, i believe you can if you really wish to do so (to a limited extent)
(keeping in mind this should not be used in a production environment, BinaryFormatter has been removed for a reason)
Well, that wont be an issue as Im not writing any production code atm
But how could I get BinaryFormatter to actually work even when its obsolete? Just disabling warnings doesnt work as the debugger errors on the construction of BF
it might be helpful if you explain the root problem you are trying to solve here
trying to get binary formatter to work isn't a good idea - it's just not worth it
if you google 'binary serialization .net core nuget' you can find multiple packages which work on modern .net
i believe my work uses some messagepack library
BinaryFormatter migration guide: Migrate to MessagePack (binary) - ...
Migrate from BinaryFormatter to MessagePack for binary serialization.
(simplest way would be compiling for .net 7, although i wouldn't call it a 'solution')
this is true, but not really the right approach morally
My root problem is to serialize a HashSet, to view its raw bytes and compare them with other HashSet
Hey, as long as it works...
Btw why was BinaryFormatter even deprecated? Whats so dangerous about it?
bunch of reason, security for example, people shooting themselves in the feet, compatibility between versions i believe also
wouldn't it make more sense to compare json for example
don't know which data you have but still..
The HashSet im comparing my HashSet with is only available as raw bytes, so cant use json or xml here
so it's not structured data
If by structured data you mean not raw bytes then yes, its not
For the security aspect, you might find this GH issue enlightening: https://github.com/dotnet/runtime/issues/98245
As for the problem at hand, if the
HashSet
contains primitive objects, you could think of using BinaryWriter
this way:
That's very barebones but gets you most of the way there, the result is a binary file that you can then read with BinaryReader
(or just simply read the thing into a byte[]
and compare byte-by-byte).
Reading your latest messages as I was typing this up, it appears you already have a binary file, but then it depends on how that file was written, and if it's a proper binary format with more than just the hashset entries, you're well and truly screwed if you don't know the specs.
Lastly, I did a cast to int
because that's the default underlying type for enums if you leave it unspecified, but it could be any of the integer types.
EDIT to add: do note that you can do a binary serialization even if it's something other than primitive types, it just takes a bit more design when deciding how to lay stuff out.GitHub
Announcement: BinaryFormatter is being removed in .NET 9 · Issue #9...
Ever since .NET Core 1.0, we in .NET Security have been trying to lay BinaryFormatter to rest. It’s long been known by security practitioners that any deserializer, binary or text, which allows its...
BinaryWriter is not obsolete, right? :blobsweat:
no
I would certainly hope it isn't, otherwise a few heads would end up on a pike!
Here's a link to the docs in case they help:
https://learn.microsoft.com/en-us/dotnet/standard/io/how-to-read-and-write-to-a-newly-created-data-file
How to: Read and write to a newly created data file - .NET
Learn how to read and write to a newly created data file in .NET using the System.IO.BinaryReader and System.IO.BinaryWriter classes.
But just to reiterate, if you don't have control over the binary file you already have or don't know how it's laid out, there isn't much you can do. Reverse engineering a binary format takes skill, luck, a lot of time and patience and dedication.
Well, I know how the memory is laid out in the file, I have the type of an object that gets dumped into memory and its not untrusted either (it kinda is but not at the same level as accepting whatever data is available)
Oh okay, then if you know exactly the format you can produce your own binary file laid out identically, and then if you just care about whether they're different and don't need to do anything more sophisticated, a simple byte-by-byte comparison will do it (and even maybe optimize for the case where the respective lengths are different).
Hell, given the two byte arrays you could even pass them to a SHA-family hash function and compare hashes for minimum effort. Would be interesting to benchmark vs writing your own for loop, actually.
Does HashSet by itself introduce some member variables that are shown in memory (ram)?
If so, thats also important for me as I want to capture the entire class
Well, it certainly has some internal state (for sure Capacity and a Comparer). But if you know the layout of the file, you also know whether it contains just the items in the set or something else too, no?
Ehh, not really. I mean, sure I know the layout but its packed of system types which have their own size, internals and stuff. That also adds into the file
Okay, maybe Im going XY. Let me rephrase the issue
I have a dump of memory of some app. I have a disassembly and the related class that contains the data I seek.
One of its members is a HashSet of enums (above).
I would assume the internals would show up in the memory apart from the actual values.
For trivial types such as int, float, double its not an issue as their byte representation is just the value of such type.
I need to dump my own HashSet in order to compare it with the dumped one from the other app to see what values it has. I know the possible values but I dont know what the actual values it stores at the moment I dump its memory.
I wanted to write it to a file as that would be the easiest way to compare these bytes. If its not possible to dump the whole class but only the public members then maybe there is a way to dump the pointer (thats pointing to the HashSet) value and either display it in console or write to file or whatever
Ah, then it really depends on what the tool doing the dump is throwing in there :/ I suspect this may be a losing battle, but I honestly don’t know much at all about memory dumps.
Lets just assume we analyze the memory itself. In my case it only dumps raw bytes from region X to region Y (i.e. Dump everything from 0x1000 to 0x2000)
So we just look at the object in memory
Yeah but do you know exactly how it's serialized? Like, if it's a set of strings but for some reason they're encoded in UTF-16, they're taking up 2x the length of a string, just as an example.
Another question would be: why do you need to do this? If you can say. I'm asking purely because someone more experienced might be able to suggest a different approach altogether.
Like, I'm not sure whether this applies to you, but if you paste the following prompt into ChatGPT, it does come up with a few suggestions:
if i have a .net memory dump containing a hashset among other things, is there a way I can build a hashset that is perfectly identical?Whether it's hallucinating or not, is up to you to figure out I'm afraid :KEKW:
Well, in memory it doesnt matter how its encoded as there are no encodings in ram but I understand why you're asking. Its UTF8 with little endianess saved (as nowadays most desktops use LE for some reason)
There is sizeof in C#, right? Could I use that + pointer to HashSet to get its value from memory? (ptr + sizeof(data) would give me end iterator)
From there i could just look at the mem region, dump bytes by hand and be done
C# does have
sizeof
, yes. sizeof(Test)
taking the enum from your example (or any enum whose underlying type is int
) will return 4.Thanks all. I will test it and get back to you
hmm, I was searching for ways to get size of a variable and I encountered
Marshal.SizeOf
.
Sadly, it cant calculate the size for managed objects.
So.. how to actually get that? Other than by looking at raw memory and essentially guess where does the object memory end@Tracer It's implementation defined — might be different on different machines. Does that method work if you add an explicit layout attribute?
you mean
[StructLayout(LayoutKind.Explicit)]
?yes that attribute, with anything other than auto in the constructor
explicit or sequential
Its only valid on class or struct declarations :/
auto might actually mean the struct default here I'm not sure
well yeah
put it on the data class
okie dokie
as in, reference types, then value types
Okay, Im probably doing something wrong :/
what are you even doing
I thought you had a class model
there's no way you can do that, I'm pretty sure
No, I want to get the size of the HashSet to then copy the bytes from debugger based on the memory region (start, start+size)
hold on
so is this for serialization into a file?
or just to like get the byte representation of the object?
get the byte representation and then dump it into a file
as byte[] i guess
it has pointers
internal pointers
I'm sure
the HashSet you mean?
yes
even if it doesn't, you must not rely on it
it's an implementaion detail
serialize it as an array
I mean, I dont rely on internal pointers. I rely on the actual space it takes in memory
then reconstruct it when you read it back
that's the best you can do
the HashSet implementation might
it might have linked lists
how are going to serialize that?
how are you going to serialize array fields?
if you just get the bytes?
they are just pointers to some other memory
Im fine with that
at least for now
dude it's going to end up as an invalid object when you read it back from the file
do you realize that?
Just do this
I dont want to read it back to the file
you'll have internal invalid pointers
ok
the foreach loop will only dump the "frontend" of the HashSet and not the whole hashset
I think there's no such API
because it's an implementation detail
there might be some internal methods for this
look at the .net source code
What is your use case
I want to dump the whole HashSet from memory to file. Thats all
i.e. the entire chunk of memory the HashSet lives in, I want to dump its bytes into a file
you won't see the value though
you'll see addresses
thats fine
for now its fine
yeah so look at the source code for .net
for internal methods
there's no public stuff for this in the standard lib
and will never be
:sadge:
how would I call the internal methods though?
through reflection
ofc
you might also look at the spec
it has the implementation details for the object layouts
and then reimplement that
but calling into existing functions is probably less work
even if it's through reflection
what here is the goal? just simply unreadable? serialize to json, convert string to bytes, add offset to each byte, repeat steps in reverse to reserialize
Again, I dont want to serialize it to json as that serves no purpose. Thats not what I want.
When you create an object, it lives in memory as raw bytes. I want to dump the whole HashSet object from memory to disk as raw bytes
In C#, a hashset is a managed object
its not a struct without references
it still lives in memory
thats not the point
actually it is
if you copy a chunk of managed memory
and load it back in
guess what happens?
u have dead pointers
whats happens when u access them?
I have never said anything about loading it back
AV exception
I know how pointers work
well whats the point then if u not going to load it back?
Please re-read my convo. The point is to not serialize it back and fourth.
The point is to read it once and see how its laid in memory
here is how its laid out in memory
replace managed objects with pointers
I meant as pure bytes, the size of the structure, etc.
the size is dynamic since its a generic class
I know
also array is dynamic
but the dynamicity of the structure only affects the actual data. The actual hashset object is static and it points to dynamic data
then follow it, each field there lives somewhere on the heap. Arrays are managed, it contains additional info.
tl;dr a hashset is not a contiguous piece of memory
its data can (and probably will) point to other places but the hashset itself is contiguous, no?
otherwise, how would one unserialize it? 🙂
I understand HashSet has internal pointers going everywhere, Thats fine. But I currently want to get only the hashset object. Thats all
Yes, as far as I know, then of course it can point to other locations for the actual contents.
“How would one deserialize it” is barking up the wrong tree imo, because as far as I know e.g. a json serializer will create one for you and just add stuff to it, it’s not like it will look at the internals (which aren’t in a json to begin with), and anyway getting rid of the chance someone would be able to craft a malicious serialized binary to interfere with object creation was exactly why BinaryFormatter was removed.
You still haven’t said WHY you want to do this, unless I’ve missed it.
the "hashset itself" is a reference type
the
_entries
field is an array (managed type)
if T
is an unmanaged type the layout for it will be a contiguous part above, depending on the size of your unmanaged T type it might be first or last in memory
ISerializable exists but that would be serializing the data which you said you dont want.
This "hashset object" you speak of is probably referring to my image above
they dont want to serialize it since it will change the way it looks in memory or something along those lines
they want a runtime interpretation of the memory at hand
if i am understanding this correctI get that, what we still don’t know is what the point of all this is
they want to know the memory layout of the object ig
I have a dump of memory of other process with HashSet<enum> there.
I want to get the boundaries of the hashset (its surrounded by other members but thats out of the scope here).
In order to do that I need to read the HashSet object from memory
short answer to that is: yes
something along those lines, yes
try asking in #allow-unsafe-blocks for "process dump to Hashset<unmanaged type> contents", though I doubt you'll like the answer they give
they understand the runtime way more than anyone in #help most likely
also for note, this will pretty much look like
possibly also shuffled around depending on pointer sizes
if you'd rather look in
_entries
it might be better. but the "hashset itself" is pretty much just thatthats what I need
so the size would be roughly 32 bytes...?
if 64bits
_entries
is an Entry[]
an entry is
T
being your type
yea, though there's probably thousands of Hashsets in the runtime at one point in time
nothing in there being something unique enough to be a patternthe pattern is the memory location, that + the size of the hashset would be sufficient
nvm
in C++ I would just read the bytes directly and/or read the size with a debuggerWhen I have only the offset to
e
, I can easily get all members by just sizeof'ing everythinghere's an example
by the way @Tracer, if you want the underlying data for retrieval purposes instead of what you described above
dotMemory (and maybe similar things)
allow u to find instances
eg.
thats JB Rider, right?
while dotMemory isnt free, im sure there are alternatives
no, this is dotMemory
another program by JetBrains
different from Rider
oh, thought its a part of Rider, nvm it then
its not a part of rider
for note on the proc dump above
i did a FULL process dump
results may vary based on partial dumps
whats the diff between
bytes
and retained bytes
?its all the bytes connected to the object
-# https://www.jetbrains.com/help/dotmemory/Getting_Started_with_dotMemory.html#-qn40au_7
and the
bytes
is just the size of the object excluding the values its pointers point to?yes
its that 64 bytes we talked about previously
so thats the value Im looking for
alr
or just throw the thing into dotMemory, search for all instances of HashSet<YourType>
if your intention is data retrieval
do you know if maybe VS has some similar functionality for that?
they probably do
i just dont know of it since i try my best to avoid VS
like its the plague (since it kinda is)
why?
its fat
bloated
you can select only what you use ¯\_(ツ)_/¯
I've sold my soul to IntelliJ IDEs, imo its worth it :Kekw:
its paid, so that will always be a huge difference between that and VS
nope
that's changed
but personally im still on a paid plan
rider is now too
yeah that was a big news with rider though I wont use it
clion is still paid with 30d trial. If they change that to free for nc use then VS is gonna have some problems with users
yea, it doesnt show as nc for me on the toolkit sadly