C#•9mo ago

new Span vs stackalloc

I'm working on a library that is able to read and write memory from a remote computer and was wondering which of these two is better practice?

            public void WriteFloat(uint address, float value)
            {
                Span<byte> memory = stackalloc byte [sizeof(float)]
                BinaryPrimitives.WriteSingleBigEndian(memory, value);
                SetMemory(address, memory, out _);
            }

            public void WriteFloat(uint address, float value)
            {
                Span<byte> memory = stackalloc byte [sizeof(float)]
                BinaryPrimitives.WriteSingleBigEndian(memory, value);
                SetMemory(address, memory, out _);
            }

            public void WriteFloat(uint address, float value)
            {
                var memory = MemoryMarshal.Cast<float, byte>(new Span<float>(ref value));
                BinaryPrimitives.WriteSingleBigEndian(memory, value);
                SetMemory(address, memory, out _);
            }

            public void WriteFloat(uint address, float value)
            {
                var memory = MemoryMarshal.Cast<float, byte>(new Span<float>(ref value));
                BinaryPrimitives.WriteSingleBigEndian(memory, value);
                SetMemory(address, memory, out _);
            }

I feel as though these would produce extremely if not exactly the same memory layouts, it creates a Span in both.

28 Replies

Kouhai•9mo ago

Both are fundamentally different, in the first one you're allocating 4 bytes on the stack (because sizeof(float) == 4) and the span basically points to said memory The second is that you're first getting a float* to the parameter value and then reinterpreting it as byte* You should go with the first option It's makes it much much more clear what you're doing

substituteOP•9mo ago

Sure, but the memory is free to be clobbered that's why the first feels wasteful, it's an extra stack allocation for stack memory In both of these, the WriteSingleBigEndian produces the same memory if the host system is big endian (writing value to itself)

Kouhai•9mo ago

Allocating memory on the stack is basically just changing a value of a register, and 4 bytes is pretty much free

substituteOP•9mo ago

Sure, but a reinterpret cast is free at least in languages like C++

Kouhai•9mo ago

In C# it'll be pretty much be free after JIT'ing as well Still I do not get why reinterepet cast a passed in parameter (even in C++) instead of just allocating 4 bytes of memory on the stack This won't impact your performance at all

substituteOP•9mo ago

A passed in parameter is already in a register (in x86_64)

Kouhai•9mo ago

Have you benchmarked it in a hot path and found out that it'll actually effect your performance? 😅 Also a parameter might not be in a register depending on it's size and the other passed in parameters In this case yes, it'll be in a register

substituteOP•9mo ago

My main concern is stack size this call is fine other calls dealing with larger segments of memory, like float[256] could overflow the stack on the stackalloc

Kouhai•9mo ago

I'm kinda confused, 256 floats would be passed in to the method?

substituteOP•9mo ago

Not in the float-only method

            public void WriteFloats(uint address, Span<float> values)
            {
                foreach (ref var value in values)
                {
                    BinaryPrimitives.WriteSingleBigEndian(MemoryMarshal.Cast<float, byte>(new Span<float>(ref value)), value);
                }
                SetMemory(address, MemoryMarshal.Cast<float, byte>(values), out _);
            }

            public void WriteFloats(uint address, Span<float> values)
            {
                foreach (ref var value in values)
                {
                    BinaryPrimitives.WriteSingleBigEndian(MemoryMarshal.Cast<float, byte>(new Span<float>(ref value)), value);
                }
                SetMemory(address, MemoryMarshal.Cast<float, byte>(values), out _);
            }

but other methods like taking a block of floats could take in N, and N could be <=> 256 a single stack alloc of float could be used if the data is moved between iterations but that also seems like it would be slower than in-place ops

Kouhai•9mo ago

That case makes more sense to be reinterpret casted, still I would use some sort of a memory pool and rent from it instead of writing directly to passed-in values Span

substituteOP•9mo ago

the underlying api I send the memory to takes byte[] 😢 so I'm trying to avoid any extra overhead before the eventual heap allocation and memcpy

Kouhai•9mo ago

I mean, you can use ArrayPool to rent byte[] :Thonkers:

substituteOP•9mo ago

that's fair, span in this case isn't guaranteed to be clobberable, but that's only a concern if someone is passing in data that they expect to use after for some reason size isn't constant, but I do know that in the usual case max size is 512 MiB and absolute max is 1024 MiB for the remote system no one should be writing all of the memory of the remote system, those are just the amounts it has 😅

Kouhai•9mo ago

Right 😅 I personally think benchmarking different options would be the best way to know if these optimizations are worth it or not, also I would suggest asking people in #allow-unsafe-blocks they are much much more knowledgeable about low level stuff

substituteOP•9mo ago

Also my only other hang-up with the arraypool is that I want the end-user to deal with how they use memory themselves (since you may also have more than one remote system that you are connected to for various things) but I'm sure there's some middle ground

if these optimizations are worth it or not

It's a project to learn more about the newer low-level C# stuff (and because the existing implementations of this are garbage) ; they are worth it to me 😂 , but I get what you mean. I've been in C++ land for a few years so there's a lot of new C# stuff

Kouhai•9mo ago

Yeah C# did change a lot and allows for much more low-level coding now Honestly, reinterpret casting will work 100%, my only concern is that once you've passed in a Span and didn't realize will change (not necessarily you, but maybe someone else using the library for example) it'll cause too many bugs 😅

substituteOP•9mo ago

Yeah, the optimal solution might end up being the middle ground with a single stack float used as temporary, clobberable storage. but I'm not liking that solution just because it'll turn into

x -> allocate y -> pass y as span to SetMemory -> allocate z from y via .toArray

x -> allocate y -> pass y as span to SetMemory -> allocate z from y via .toArray

which is very wasteful I can probably write an overload for byte[] directly to avoid that cost that might be the best option tbh that or unclobbering it after writing I guess

Kouhai•9mo ago

An overload for WriteFloats?

substituteOP•9mo ago

it's not like we lose the actual data, we just swap from little to big endian if the system is little endian SetMemory

Kouhai•9mo ago

Oh write

substituteOP•9mo ago

          public void SetMemory(uint address, Span<byte> memory, out uint wrote)
          {
              _com.DebugTarget.SetMemory(address, unchecked((uint)memory.Length), memory.ToArray(), out wrote);
          }

          public void SetMemory(uint address, Span<byte> memory, out uint wrote)
          {
              _com.DebugTarget.SetMemory(address, unchecked((uint)memory.Length), memory.ToArray(), out wrote);
          }

yeah, I think I'm going to do the middle ground and just write an overload for SetMemory I don't trust users to read documentation about it clobbering the input if I publish this, so I'll just create a copy that I clobber. Thanks!

Kouhai•9mo ago

Never trust users 😅 People do not even read offical langauge docs

substituteOP•9mo ago

yeah, and the last thing I need is someone confused because it works on a big endian machine but not on their personal computer on little endian because they won't even know what is wrong other than "don't work" @Kouhai Compiled both on sharplabs, the reinterpret cast is fewer instructions in the JIT assembly in debug, but the stack alloc is fewer instructions in release Interesting results It is additionally interesting as they’re doing the same thing (writing to some stack memory); Time to dive into the rabbit hole. (One just reuses the arg while the other uses new stack)

Kouhai•9mo ago

You might also wanna try examining assembly on Godbolt, it has better code generation compared to sharplabs

substituteOP•9mo ago

Huh, I didn’t know gb supported C# I use them for C++ stuff stackalloc

Program:<<Main>$>g__WriteFloat2|0_2(uint,float) (FullOpts):
G_M4409_IG01:  ;; offset=0x0000
       sub      rsp, 24
       vzeroupper 
       xor      eax, eax
       mov      qword ptr [rsp+0x08], rax
       mov      qword ptr [rsp+0x10], 0xCBAE90
G_M4409_IG02:  ;; offset=0x0017
       lea      rax, [rsp+0x08]
       vmovd    ecx, xmm0
       bswap    ecx
       mov      dword ptr [rax], ecx
       cmp      qword ptr [rsp+0x10], 0xCBAE90
       je       SHORT G_M4409_IG03
       call     CORINFO_HELP_FAIL_FAST
G_M4409_IG03:  ;; offset=0x0034
       nop      
G_M4409_IG04:  ;; offset=0x0035
       add      rsp, 24
       ret

Program:<<Main>$>g__WriteFloat2|0_2(uint,float) (FullOpts):
G_M4409_IG01:  ;; offset=0x0000
       sub      rsp, 24
       vzeroupper 
       xor      eax, eax
       mov      qword ptr [rsp+0x08], rax
       mov      qword ptr [rsp+0x10], 0xCBAE90
G_M4409_IG02:  ;; offset=0x0017
       lea      rax, [rsp+0x08]
       vmovd    ecx, xmm0
       bswap    ecx
       mov      dword ptr [rax], ecx
       cmp      qword ptr [rsp+0x10], 0xCBAE90
       je       SHORT G_M4409_IG03
       call     CORINFO_HELP_FAIL_FAST
G_M4409_IG03:  ;; offset=0x0034
       nop      
G_M4409_IG04:  ;; offset=0x0035
       add      rsp, 24
       ret

re-use existing heap

Program:<<Main>$>g__WriteFloat|0_1(uint,float) (FullOpts):
G_M5640_IG01:  ;; offset=0x0000
       push     rbp
       sub      rsp, 16
       vzeroupper 
       lea      rbp, [rsp+0x10]
       vmovss   dword ptr [rbp-0x04], xmm0
G_M5640_IG02:  ;; offset=0x0012
       lea      rdi, bword ptr [rbp-0x04]
       mov      eax, 4
       vmovss   xmm0, dword ptr [rbp-0x04]
       vmovd    ecx, xmm0
       bswap    ecx
       cmp      eax, 4
       jb       SHORT G_M5640_IG05
       mov      dword ptr [rdi], ecx
G_M5640_IG03:  ;; offset=0x002D
       add      rsp, 16
       pop      rbp
       ret      
G_M5640_IG04:  ;; offset=0x0033
       call     CORINFO_HELP_OVERFLOW
G_M5640_IG05:  ;; offset=0x0038
       mov      edi, 40
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException(int)]
       int3

Program:<<Main>$>g__WriteFloat|0_1(uint,float) (FullOpts):
G_M5640_IG01:  ;; offset=0x0000
       push     rbp
       sub      rsp, 16
       vzeroupper 
       lea      rbp, [rsp+0x10]
       vmovss   dword ptr [rbp-0x04], xmm0
G_M5640_IG02:  ;; offset=0x0012
       lea      rdi, bword ptr [rbp-0x04]
       mov      eax, 4
       vmovss   xmm0, dword ptr [rbp-0x04]
       vmovd    ecx, xmm0
       bswap    ecx
       cmp      eax, 4
       jb       SHORT G_M5640_IG05
       mov      dword ptr [rdi], ecx
G_M5640_IG03:  ;; offset=0x002D
       add      rsp, 16
       pop      rbp
       ret      
G_M5640_IG04:  ;; offset=0x0033
       call     CORINFO_HELP_OVERFLOW
G_M5640_IG05:  ;; offset=0x0038
       mov      edi, 40
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException(int)]
       int3

from Godbolt on .Net 8, interesting.

Kouhai•9mo ago

stackalloc's code gen seems to be better 😅

substituteOP•9mo ago

probably for many values I ended up adding an overload for taking a byte[] directly

            public void WriteFloats(uint address, Span<float> values)
            {
                if (BitConverter.IsLittleEndian)
                {
                    var memory = new byte[values.Length * sizeof(float)];
                    BinaryPrimitives.ReverseEndianness(
                        MemoryMarshal.Cast<float, int>(values),
                        MemoryMarshal.Cast<byte, int>(memory));
                    SetMemory(address, memory, out _);
                    return;
                }
                SetMemory(address, MemoryMarshal.Cast<float, byte>(values), out _);
            }

            public void WriteFloats(uint address, Span<float> values)
            {
                if (BitConverter.IsLittleEndian)
                {
                    var memory = new byte[values.Length * sizeof(float)];
                    BinaryPrimitives.ReverseEndianness(
                        MemoryMarshal.Cast<float, int>(values),
                        MemoryMarshal.Cast<byte, int>(memory));
                    SetMemory(address, memory, out _);
                    return;
                }
                SetMemory(address, MemoryMarshal.Cast<float, byte>(values), out _);
            }

There is a method to reverse the endianness of N values from a span. so I create a copy if I need to reverse the endianness and send that in as byte[] (to avoid another copy) otherwise I just pipe it directly in still need to test it but should work. further cleanup can still happen once I've tested it. I may be able to avoid a copy for already big endian systems if I am able to add extension methods for COM imports (it’s a COM class)

Gaming

Programming

new Span vs stackalloc

Did you find this page helpful?