C#•5mo ago

Micro-optimizing a Z80 emulators' pipeline. Unsafe code

so, i'm writing an emulator, and i'm trying to squeeze as much perf as i can out of hot paths. this seemingly simple fetch operation consumes about a third of the CPU time:

private byte Fetch()
{
    return _memory.Read(Registers.PC++);
}

private byte Fetch()
{
    return _memory.Read(Registers.PC++);
}

my memory class looks like this:

private GCHandle _memHandle;
private byte* pMem;
private byte[] _memory;

public MainMemory(int size) 
{
    // pin array and get GC ptr. omitted for brevity.
    pMem = (byte*)_memHandle.AddrOfPinnedObject();
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public byte Read(ushort address) => pMem[address];

private GCHandle _memHandle;
private byte* pMem;
private byte[] _memory;

public MainMemory(int size) 
{
    // pin array and get GC ptr. omitted for brevity.
    pMem = (byte*)_memHandle.AddrOfPinnedObject();
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public byte Read(ushort address) => pMem[address];

i know, it's a bit messy, and it's not really safe either, but boy does it give some perf gains. it's worth it. also, the array wont change size so it should be alright. my registers are in a similar situation. the register set is actually an array of bytes that are accessed using constant indexers into said array, like this:

private GCHandle _regSetHandle;
private byte* pRegSet;
public byte[] RegisterSet;
public ProcessorRegisters()
{
    RegisterSet = new byte[26];
    _regSetHandle = GCHandle.Alloc(RegisterSet, GCHandleType.Pinned);
    pRegSet = (byte*)_regSetHandle.AddrOfPinnedObject();
}
// example
byte regA = pRegSet[Registers.A]; // A is an indexer into the array; i tried to follow the Z80's convention, so A = 7

private GCHandle _regSetHandle;
private byte* pRegSet;
public byte[] RegisterSet;
public ProcessorRegisters()
{
    RegisterSet = new byte[26];
    _regSetHandle = GCHandle.Alloc(RegisterSet, GCHandleType.Pinned);
    pRegSet = (byte*)_regSetHandle.AddrOfPinnedObject();
}
// example
byte regA = pRegSet[Registers.A]; // A is an indexer into the array; i tried to follow the Z80's convention, so A = 7

but, this is a Z80, meaning it also has 16-bit register pairs. this is important, because you can either access it as its high and low parts, or its entire pair, meaning that the exposed pairs depend on this same array, so i implemented them using properties

public ushort PC
{
    get => (ushort)((pRegSet[PCi] << 8) | pRegSet[PCiL]);
    set
    {
        pRegSet[PCi] = (byte)(value >> 8);
        pRegSet[PCiL] = (byte)value;
    }
}

public ushort PC
{
    get => (ushort)((pRegSet[PCi] << 8) | pRegSet[PCiL]);
    set
    {
        pRegSet[PCi] = (byte)(value >> 8);
        pRegSet[PCiL] = (byte)value;
    }
}

with all of this in mind, how can i make that fetch instruction faster and use less CPU time?

151 Replies

Buddy•5mo ago

This might be something for #allow-unsafe-blocks

PdawgOP•5mo ago

should i forward a link to this thread, or repost there?

Buddy•5mo ago

Probably forward it maybe? just say that you were redirected there

PdawgOP•5mo ago

okay

Aaron•5mo ago

if that's your Fetch? there really isn't much you can do

PdawgOP•5mo ago

there's no crazy pointer magic left for me to do?

Aaron•5mo ago

they're not magic, sadly

PdawgOP•5mo ago

lol what about interfaces? is there anything i can do to optimize properties in interfaces? the interrupt handler is another hot path that also is unavoidable

Aaron•5mo ago

interfaces?

PdawgOP•5mo ago

PdawgOP•5mo ago

the databus is represented by an interface it has to be because this emulator is (going to be) a library

Aaron•5mo ago

Fetch/Read wouldn't happen to be interface methods, would they

PdawgOP•5mo ago

no no lol ^ im talking about properties here

Aaron•5mo ago

:j_phew:

PdawgOP•5mo ago

/// <summary>
/// Maskable Interrupt raised
/// </summary>
bool MI { get; set; }

/// <summary>
/// Non-maskable Interrupt raised.
/// </summary>
bool NMI { get; set; }

/// <summary>
/// Maskable Interrupt raised
/// </summary>
bool MI { get; set; }

/// <summary>
/// Non-maskable Interrupt raised.
/// </summary>
bool NMI { get; set; }

^ this

Buddy•5mo ago

You can use $paste to send full code snippets

MODiX•5mo ago

If your code is too long, you can post to https://paste.mod.gg/, save, and copy the link into chat for others to see your shared code!

Aaron•5mo ago

I mean interfaces aren't fast, but there's not much you can do about that

PdawgOP•5mo ago

simply checking both every clock cycle takes a serious amount of CPU time it has to be done tho cause, ya know, thats how cpus work

PdawgOP•5mo ago

15% 😭 hmm - i guess ill pack em into a byte or something so that its only reading once per instruction marginal improvement but it gave me another ~600k instructions per second so ill take it

PdawgOP•5mo ago

this is fast enough righttt?

PdawgOP•5mo ago

this is in release mode 100 million instructions a second, it's good enough. way better than anything else i've tried in C#. the fastest one behind Z80Sharp was 30 million/s

Klarth•5mo ago

I'd say that anything around an order of magnitude loss is expected for CPU emulation. Pushing much further than you are will probably require codegen to native.

PdawgOP•5mo ago

i actually haven't tried AoT yet, i wonder how fast it'd go

PdawgOP•5mo ago

as expected, its a bit slower. i could work towards AoT perf more but it's whatever

Klarth•5mo ago

I don't mean AOT. That's going to run slower at steady state than a very warm JIT. I mean to codegen the z80 into x64 assembly and then run x64 assembly. Sometimes called dynamic recompilation.

PdawgOP•5mo ago

ah that’s what I was about to ask I am terrible at modern x86(_64) assembly so I’ll steer clear lol. z80 and ARMv6 ASM is where it’s at 😭

Klarth•5mo ago

You can't do this across the entire software being emulated, but you can for stretches if you have good detection. Software also tends to codegen in RAM, so you need to ensure that it's the same when you rerun cached x64 output.

PdawgOP•5mo ago

earlier today I was doing some bug fixing and finally got BASIC to actually execute though! RLCA and inc ix/iy were messed up (idk how I even messed that up, it’s so simple. wasn’t paying attention I guess.)

cap5lut•5mo ago

more guessing here, but if u store the register array internally as ushort[] instead of byte[], it will be aligned to 2 byte, then for the 16 bit register u can do just one aligned read/write, instead of reading/writing/bit shifting the bytes and if u use an inline array for the registers and pin ProcessorRegisters, u can probably also safe one indirection, which might improve performance

Klarth•5mo ago

I assume it's that way because they didn't want to worry about endianness on big endian systems.

cap5lut•5mo ago

nothing a bit of BinaryPrimitives usage cant fix ;p

PdawgOP•5mo ago

I like where you’re going, but most registers are used in their 8-bit forms, so it makes sense to leave them as such. The reason I specifically optimized the pairs was because the PC register is constantly having to be read from/written to. I may actually move it out to its own ushort tho, as i don’t think I really need to address it as individual 8-bit chunks

cap5lut•5mo ago

well, u can still keep the byte pointer to read the 8 bit registers individually, it would only benefit that ushort, but has no effects on the others. having them in the same continuous memory will also help with caching, i would assume, so having them all on the same cache line might be important

PdawgOP•5mo ago

I pinned the byte array. is it inline by default? or does indexing into it like pRegSet[7] //A indexer is 7 still require some lookup I’d assume its inline - that would make the most sense

cap5lut•5mo ago

pinning an array just means the gc wont move it when compacting, thats irrelevant for the cpu cache. but if u have for example

struct Registers
{
  byte[] _registers;
  ushort _otherRegister;
}

struct Registers
{
  byte[] _registers;
  ushort _otherRegister;
}

u ur array is somewhere else in the memory than ur ushort register

PdawgOP•5mo ago

yeah I know what pinning does, but I’m asking if the elements in an array are already inline. normally, the standard allocator in C or smth gives you a contiguous block of memory. is that not the same in C#?

Klarth•5mo ago

See https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-12#inline-arrays A type will otherwise hold a pointer to an array.

PdawgOP•5mo ago

ooh that looks cool but the only thing I don’t like is having to define each element individually

cap5lut•5mo ago

with something like this u can keep the registers close to each other:

public struct Registers
{
    private Registers8Bit _normalRegisters;
    private ushort _specifalRegister;
    
    [InlineArray(24)]
    private struct Registers8Bit
    {
        private byte _element0;
    }
}

public struct Registers
{
    private Registers8Bit _normalRegisters;
    private ushort _specifalRegister;
    
    [InlineArray(24)]
    private struct Registers8Bit
    {
        private byte _element0;
    }
}

PdawgOP•5mo ago

ah it just generates the whole set for you?

cap5lut•5mo ago

afaik as long as the array size is as max as big as the page size, they will be physically continuous

PdawgOP•5mo ago

I’ll try this tmrw, thanks

cap5lut•5mo ago

yes, and it will take care of aligning correctly

PdawgOP•5mo ago

awesome i wonder if exposing the system RAM in a similar way could have perf gains

Klarth•5mo ago

An exterior array on the heap*

PdawgOP•5mo ago

@cap5lut i tried an inline array and broke 100MIPS! however, i think we can go faster. this is my current impl:

[InlineArray(24)]
public struct RegisterArray
{
    private byte _element0;
}
public RegisterArray RegisterSet;

public ProcessorRegisters()
{
    RegisterSet = new RegisterArray();        
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private ref byte GetRegisterRef(byte index)
{
    return ref Unsafe.Add(ref Unsafe.As<RegisterArray, byte>(ref RegisterSet), index);
}

[InlineArray(24)]
public struct RegisterArray
{
    private byte _element0;
}
public RegisterArray RegisterSet;

public ProcessorRegisters()
{
    RegisterSet = new RegisterArray();        
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private ref byte GetRegisterRef(byte index)
{
    return ref Unsafe.Add(ref Unsafe.As<RegisterArray, byte>(ref RegisterSet), index);
}

which is used like this:

public ushort BC
{
    get => (ushort)((GetRegisterRef(B) << 8) | GetRegisterRef(C));
    set
    {
        GetRegisterRef(B) = (byte)(value >> 8);
        GetRegisterRef(C) = (byte)value;
    }
}

public ushort BC
{
    get => (ushort)((GetRegisterRef(B) << 8) | GetRegisterRef(C));
    set
    {
        GetRegisterRef(B) = (byte)(value >> 8);
        GetRegisterRef(C) = (byte)value;
    }
}

it doesnt seem like you can pin an inline array since its a value type, so you cant cache the pointer to it. unfortunately, the GetRegisterRef code is using a fair bit of CPU time now

PdawgOP•5mo ago

the generated code has a lot of overhead

ero•5mo ago

Pretty sure ref RegisterSet[index] should just work?

PdawgOP•5mo ago

would that introduce bound checks?

PdawgOP•5mo ago

the answer is yes

ero•5mo ago

Not sure if you really want to eliminate those...

PdawgOP•5mo ago

i do this array is a fixed size and is only accessible internally they tank the perf

Klarth•5mo ago

I'd try to skip the shift + or in favor of a single 2-byte read. I'm assuming these registers are contiguous in the array. Then all you need to do is use BinaryPrimitives to ensure the correct endian.

ero•5mo ago

What even are you using to benchmark? Surely you're not replying on a stopwatch or something

PdawgOP•5mo ago

im using visual studios profiler oh you mean for the instr/s it's exactly what it says it is how many instructions executed in the last second theres a cycle counter they are indeed, but i need unsafe access to avoid bound checking which doesnt seem to be possible w/o using return ref Unsafe.Add(ref Unsafe.As<RegisterArray, byte>(ref RegisterSet), index); atleast in the context of an inline array going back to a regular array gives me the ability to cache a pointer, but it's still a bit slower than the current impl

Klarth•5mo ago

Maybe the pointer route, but I'm not really sure.

ero•5mo ago

I mean you could try a fixed size array?

public fixed byte RegisterSet[24];

public fixed byte RegisterSet[24];

PdawgOP•5mo ago

how would you pin that on the heap tho? you cant use a fixed buffer in an unfixed expression

ero•5mo ago

I'm not really sure how you're looking to use it

PdawgOP•5mo ago

hmm, i can get an unsafe ref using fixed though meh, its slower i just need an array that i can access directly without using bound checks. but i think Klarth's idea with 2-byte reads sounds like a good idea i need to try that

jen•5mo ago

nativememory and a pointer

PdawgOP•5mo ago

hmmm good idea

ero•5mo ago

i just don't really believe that a fixed buffer is slower, it should be faster in theory...

public ushort BC
{
  get
  {
    fixed (byte* pReg = RegisterSet)
      return *(ushort*)(pReg + B);
  }
  set
  {
    fixed (byte* pReg = RegisterSet)
      *(ushort*)(pReg + B) = value;
  }
}

public ushort BC
{
  get
  {
    fixed (byte* pReg = RegisterSet)
      return *(ushort*)(pReg + B);
  }
  set
  {
    fixed (byte* pReg = RegisterSet)
      *(ushort*)(pReg + B) = value;
  }
}

right?

jen•5mo ago

a pointer is just as fast for anything as long as its in cache

ero•5mo ago

right i mean if you're gonna go with nativememory i don't think much is beating it once the block is allocated

PdawgOP•5mo ago

i swear nativememory will be the death of me

ero•5mo ago

Remember to free what you alloc xP

PdawgOP•5mo ago

lol yeah wait nvm im dumb i just wrote this wrong ive been doing this for too long

MODiX•5mo ago

ero

REPL Result: Failure

unsafe readonly struct S
{
  private readonly byte* _registerSet;

  public S()
  {
    _registerSet = (byte*)NativeMemory.AllocZeroed(26);
  }

  public ushort BC
  {
    get => *(ushort*)(_registerSet + B);
    set => *(ushort*)(_registerSet + B) = value;
  }

  public void Dispose()
  {
    NativeMemory.Free(_registerSet);
  }
}

unsafe readonly struct S
{
  private readonly byte* _registerSet;

  public S()
  {
    _registerSet = (byte*)NativeMemory.AllocZeroed(26);
  }

  public ushort BC
  {
    get => *(ushort*)(_registerSet + B);
    set => *(ushort*)(_registerSet + B) = value;
  }

  public void Dispose()
  {
    NativeMemory.Free(_registerSet);
  }
}

Exception: CompilationErrorException

- The name 'B' does not exist in the current context
- The name 'B' does not exist in the current context

- The name 'B' does not exist in the current context
- The name 'B' does not exist in the current context

Compile: 383.812ms | Execution: 0.000ms | React with ❌ to remove this embed.

ero•5mo ago

Is what I'm thinking Oh whoops Also an inline array in a class is crazy

PdawgOP•5mo ago

lol dw its not actually like that in the emu this is a random test project i tried that - and it works...sorta? but soon after execution starts it just dies and idk why i have a feeling theres some memory relocation going on that i don't know about and its breaking everything

PdawgOP•5mo ago

????

Aaron•5mo ago

uh yeah don't do that fixed only fixes something until the block ends

PdawgOP•5mo ago

yeah figured that out sooo, it works, but the endianness is flipped. is there an efficient way to flip it upon reading?

PdawgOP•5mo ago

ret is the correct value which is what i was doing before

Aaron•5mo ago

either do what you do with ret or call BinaryPrimtives.ReverseEndianness

PdawgOP•5mo ago

why is the compiler doing this

PdawgOP•5mo ago

this mysterious function omfg the warning

PdawgOP•5mo ago

WHY 😭

PdawgOP•5mo ago

the current standings. I’ll have to try optimizing the R16 getter method with tanner’s suggestion, but this is looking pretty good! single threaded on a ryzen 7 5800x

cap5lut•5mo ago

nice 😄 do u have an online repo? im quite interested in peeking at it 😄

PdawgOP•5mo ago

it’s private rn, I’ll open it when the emulator core is more or less complete

cap5lut•5mo ago

kk this was btw my thought on how to model the register stuff:

public struct RegisterSet
{
    private ushort _af;
    private ushort _bc;
    private ushort _de;
    private ushort _hl;
    private ushort _sp;
    private ushort _pc;
    
    [UnscopedRef] public ref ushort AF => ref _af;
    [UnscopedRef] public ref byte A => ref Unsafe.Add(ref F, 1);
    [UnscopedRef] public ref byte F => ref Unsafe.As<ushort, byte>(ref AF);
    [UnscopedRef] public FlagRegister Flags => new FlagRegister(ref F);
    
    [UnscopedRef] public ref ushort BC => ref _bc;
    [UnscopedRef] public ref byte B => ref Unsafe.Add(ref C, 1);
    [UnscopedRef] public ref byte C => ref Unsafe.As<ushort, byte>(ref BC);
    
    [UnscopedRef] public ref ushort DE => ref _de;
    [UnscopedRef] public ref byte D => ref Unsafe.Add(ref E, 1);
    [UnscopedRef] public ref byte E => ref Unsafe.As<ushort, byte>(ref DE);
    
    [UnscopedRef] public ref ushort HL => ref _hl;
    [UnscopedRef] public ref byte H => ref Unsafe.Add(ref L, 1);
    [UnscopedRef] public ref byte L => ref Unsafe.As<ushort, byte>(ref HL);
    
    [UnscopedRef] public ref ushort SP => ref _sp;
    [UnscopedRef] public ref byte SPlo => ref Unsafe.Add(ref SPhi, 1);
    [UnscopedRef] public ref byte SPhi => ref Unsafe.As<ushort, byte>(ref SP);
    
    [UnscopedRef] public ref ushort PC => ref _pc;
}

public struct RegisterSet
{
    private ushort _af;
    private ushort _bc;
    private ushort _de;
    private ushort _hl;
    private ushort _sp;
    private ushort _pc;
    
    [UnscopedRef] public ref ushort AF => ref _af;
    [UnscopedRef] public ref byte A => ref Unsafe.Add(ref F, 1);
    [UnscopedRef] public ref byte F => ref Unsafe.As<ushort, byte>(ref AF);
    [UnscopedRef] public FlagRegister Flags => new FlagRegister(ref F);
    
    [UnscopedRef] public ref ushort BC => ref _bc;
    [UnscopedRef] public ref byte B => ref Unsafe.Add(ref C, 1);
    [UnscopedRef] public ref byte C => ref Unsafe.As<ushort, byte>(ref BC);
    
    [UnscopedRef] public ref ushort DE => ref _de;
    [UnscopedRef] public ref byte D => ref Unsafe.Add(ref E, 1);
    [UnscopedRef] public ref byte E => ref Unsafe.As<ushort, byte>(ref DE);
    
    [UnscopedRef] public ref ushort HL => ref _hl;
    [UnscopedRef] public ref byte H => ref Unsafe.Add(ref L, 1);
    [UnscopedRef] public ref byte L => ref Unsafe.As<ushort, byte>(ref HL);
    
    [UnscopedRef] public ref ushort SP => ref _sp;
    [UnscopedRef] public ref byte SPlo => ref Unsafe.Add(ref SPhi, 1);
    [UnscopedRef] public ref byte SPhi => ref Unsafe.As<ushort, byte>(ref SP);
    
    [UnscopedRef] public ref ushort PC => ref _pc;
}

with FlagRegister being a little helper to set the flags more easily:

public readonly ref struct FlagRegister
{
    private readonly ref byte _register;
    public FlagRegister(ref byte register) => _register = ref register;
    
    public bool Test(RegisterFlag flag) => (byte)(_register & (byte)flag) == (byte)flag;
    public void Set(RegisterFlag flag, bool isSet)
    {
        if (isSet)
        {
            Set(flag);
        }
        else
        {
            Reset(flag);
        }
    }
    public void Set(RegisterFlag flag) => _register |= (byte)flag;
    public void Reset() => _register = 0;
    public void Reset(RegisterFlag flag) => _register &= (byte)~flag;
}

public readonly ref struct FlagRegister
{
    private readonly ref byte _register;
    public FlagRegister(ref byte register) => _register = ref register;
    
    public bool Test(RegisterFlag flag) => (byte)(_register & (byte)flag) == (byte)flag;
    public void Set(RegisterFlag flag, bool isSet)
    {
        if (isSet)
        {
            Set(flag);
        }
        else
        {
            Reset(flag);
        }
    }
    public void Set(RegisterFlag flag) => _register |= (byte)flag;
    public void Reset() => _register = 0;
    public void Reset(RegisterFlag flag) => _register &= (byte)~flag;
}

PdawgOP•5mo ago

I have flag register stuff down already idk if this would be faster tbh - inlinearray has some serious speed gains. it’s even faster than NativeMemory and a pointer I’m eager to try tanners suggestion, always love to see that cpu time go down I think it’ll help a lot because a lot of different instructions depend on GetR16FromHighIndexer

cap5lut•5mo ago

hmmm i forgot about the index lookup doesnt seem too shabby 😄

MODiX•5mo ago

cap5lut

sharplab.io (click here)

static void Test(RegisterSet registers) {
    ref var reg = ref registers.GetR16FromIndex(3);
    reg = 255;
    reg++;
    registers.Flags.Reset();
    registers.Flags.Set(RegisterFlag.Zero);
    registers.Flags.Set(RegisterFlag.Carry);
}
public struct RegisterSet {
    private ushort _af;
// 62 more lines. Follow the link to view.

static void Test(RegisterSet registers) {
    ref var reg = ref registers.GetR16FromIndex(3);
    reg = 255;
    reg++;
    registers.Flags.Reset();
    registers.Flags.Set(RegisterFlag.Zero);
    registers.Flags.Set(RegisterFlag.Carry);
}
public struct RegisterSet {
    private ushort _af;
// 62 more lines. Follow the link to view.

React with ❌ to remove this embed.

PdawgOP•5mo ago

not terrible

PdawgOP•5mo ago

i think it's time this journey comes to an end. it maintains a stable ~120 MIPS on my R7 5700X. i still have to actually fix faulty instruction logic and do a lot of cleanup before it's ready for release, so i think it makes sense to be done with the optimization. when i open source it, feel free to test out optimizations yourselves and make a PR if it doesn't destroy the code too much. thank you all for the suggestions! before optimizing, the CPU time was a mess (see the second image), and it was barely able to hit 5MIPS in debug mode. now, in debug mode, it runs at around 15MIPS. release mode was a similar story, running at just 70MIPS (maybe less before i started, i never thought to check) even with minimal optimization. these past few days have been a great learning experience. thanks again!

PdawgOP•5mo ago

here's a full zexdoc run with profiling enabled in debug mode. there's really not much that can be done about the inlinearray stuff unfortunately, i've tried using constants as much as i can. GetR16FromHighIndexer makes heavy use of that function so that's why it's up there. but overall, i'm satisfied. this took just under 7 minutes. in release mode, it completes the test in about 50 seconds - on real hardware, it takes literal hours!

PdawgOP•5mo ago

cap5lut•5mo ago

whats the current code of GetR16FromHighIndexer?

PdawgOP•5mo ago

same as it has been. I’ve tried using raw pointers, the constant incremented indexers, and the Unsafe.As stuff. somehow, even with this overhead, the JIT is able to optimize the regular inlinearray code better than anything else I would try your idea with the ref stuff, but the registers are also addressed individually a lot, not just in pairs

cap5lut•5mo ago

i had ref byte properties for the 8 bit registers as well

PdawgOP•5mo ago

currently r16 looks like this

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public ushort GetR16FromHighIndexer([ConstantExpected] byte indexer) => (ushort)((RegisterSet[indexer] << 8) | RegisterSet[indexer + 1]);

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public ushort GetR16FromHighIndexer([ConstantExpected] byte indexer) => (ushort)((RegisterSet[indexer] << 8) | RegisterSet[indexer + 1]);

but the pairs are composed of their 8 bit subsets nvm yours accounts for that

cap5lut•5mo ago

yeah, all 3 properties access the same ushort

    [UnscopedRef] public ref ushort HL => ref _hl;
    [UnscopedRef] public ref byte H => ref Unsafe.Add(ref L, 1);
    [UnscopedRef] public ref byte L => ref Unsafe.As<ushort, byte>(ref HL);

    [UnscopedRef] public ref ushort HL => ref _hl;
    [UnscopedRef] public ref byte H => ref Unsafe.Add(ref L, 1);
    [UnscopedRef] public ref byte L => ref Unsafe.As<ushort, byte>(ref HL);

PdawgOP•5mo ago

yup now, the other issue is, instructions are decoded in a way that makes it easy to use “generic” functions so like LD_R_R can cover multiple instructions that’s why I use an array in the first place LD_R_R takes in byte dest, byte source and just accesses the array with those inputs

cap5lut•5mo ago

but the actual instruction is something like LD C, A, so u could do something like

public void LD_R_R(ref byte dest, ref byte source) => dest = source;

public void LD_R_R(ref byte dest, ref byte source) => dest = source;

and call case <opcode>: LD_R_R(ref _registers.C, ref _registers.A); break;

PdawgOP•5mo ago

ah that might actually be faster yeah a whole hell of a lot of refactoring tho

cap5lut•5mo ago

the 8 bit registers might be in wrong order there btw 😂

PdawgOP•5mo ago

they are, yes, D comes before E, but it’s fine lol for some reason the high register in the pair is a lower indexer except for the AF pair, F comes before A

cap5lut•5mo ago

well, makes sense for little endian

PdawgOP•5mo ago

how much faster do you think that impl would be, if at all? ig I should just write a test and use benchmark.net

cap5lut•5mo ago

i have no clue if its faster at all 😂

PdawgOP•5mo ago

lol I’ll look tmrw

cap5lut•5mo ago

what i can say is, due to mostly handling refs, the JIT can probably optimize better, because it doesnt really have to look into what happens with the value u read and where it ends up

PdawgOP•5mo ago

the only thing that I’m concerned about is the use of Unsafe.As and stuff every time it’s accessed and Unsafe.Add

cap5lut•5mo ago

if u look at this modified sharplab, u can see in the asm, that Unsafe.Add will be resolved to a constant offset. Unsafe.As doesnt even emit any code, thats purely "meta info" for roslyn and the JIT

MODiX•5mo ago

cap5lut

sharplab.io (click here)

static void Test(RegisterSet registers) {
    ref var reg = ref registers.A;
    reg = 255;
    reg++;
    registers.Flags.Reset();
    registers.Flags.Set(RegisterFlag.Zero);
    registers.Flags.Set(RegisterFlag.Carry);
}
public struct RegisterSet {
    private ushort _af;
// 62 more lines. Follow the link to view.

static void Test(RegisterSet registers) {
    ref var reg = ref registers.A;
    reg = 255;
    reg++;
    registers.Flags.Reset();
    registers.Flags.Set(RegisterFlag.Zero);
    registers.Flags.Set(RegisterFlag.Carry);
}
public struct RegisterSet {
    private ushort _af;
// 62 more lines. Follow the link to view.

React with ❌ to remove this embed.

PdawgOP•5mo ago

ah cool I’m on mobile rn lol I would’ve looked otherwise

cap5lut•5mo ago

    L0000: mov byte ptr [esp+5], 0xff
    L0005: movzx eax, byte ptr [esp+5]
    L000a: inc eax
    L000b: mov [esp+5], al

    L0000: mov byte ptr [esp+5], 0xff
    L0005: movzx eax, byte ptr [esp+5]
    L000a: inc eax
    L000b: mov [esp+5], al

thats basically the asm for

    ref var reg = ref registers.A;
    reg = 255;
    reg++;

    ref var reg = ref registers.A;
    reg = 255;
    reg++;

byte ptr [esp+5] is ref registers.A

PdawgOP•5mo ago

nice how does esp get the base ptr tho? or is that on the stack I don’t know my x86 registers well

cap5lut•5mo ago

since i made it a method that takes RegisterSet as a parameter, i guess due to managed calling convention the struct is spilled onto the stack and esp gets a pointer to that, before the actual call happens

PdawgOP•5mo ago

yeah I think ref structs are on the stack makes sense, ESP is apparently the main stack pointer i hate x86_64

MODiX•5mo ago

cap5lut

sharplab.io (click here)

public class Cpu {
    private RegisterSet _registers = new();
    public void LD_R_R(ref byte dst, ref byte src) => dst ...
    public void LD_C_A() => LD_R_R(ref _registers.C, ref _...
}
public struct RegisterSet {
    private ushort _af;
    private ushort _bc;
    private ushort _de;
    private ushort _hl;
// 59 more lines. Follow the link to view.

public class Cpu {
    private RegisterSet _registers = new();
    public void LD_R_R(ref byte dst, ref byte src) => dst ...
    public void LD_C_A() => LD_R_R(ref _registers.C, ref _...
}
public struct RegisterSet {
    private ushort _af;
    private ushort _bc;
    private ushort _de;
    private ushort _hl;
// 59 more lines. Follow the link to view.

React with ❌ to remove this embed.

cap5lut•5mo ago

lots of optimizations/inlining done there

Cpu.LD_R_R(Byte ByRef, Byte ByRef)
    L0000: mov eax, [esp+4]
    L0004: movzx eax, byte ptr [eax]
    L0007: mov [edx], al
    L0009: ret 4

Cpu.LD_C_A()
    L0000: movzx eax, byte ptr [ecx+5]
    L0004: mov [ecx+6], al
    L0007: ret

Cpu.LD_R_R(Byte ByRef, Byte ByRef)
    L0000: mov eax, [esp+4]
    L0004: movzx eax, byte ptr [eax]
    L0007: mov [edx], al
    L0009: ret 4

Cpu.LD_C_A()
    L0000: movzx eax, byte ptr [ecx+5]
    L0004: mov [ecx+6], al
    L0007: ret

and now esp is actually the Cpu this

PdawgOP•5mo ago

damn. what about if you call ld_r_r from a switch? Same outcome? in the real emu it’s like case whatever: LD_R_R(ref _registers.C, ref _registers.A);

cap5lut•5mo ago

well, since u will have mostly a switch over almost the whole byte range, i doubt there will be much inlining

PdawgOP•5mo ago

alr, I’ll benchmark it all tmrw in a test project. I gotta go to bed lol, it’s 01:30 thanks for doing all of this research!

cap5lut•5mo ago

if u have methods like LD_C_A that call LD_R_R internally, it might be better if there is no inlining, because then its basically all about keeping this in a register and the switch will become a hug jump table but iirc then u will really have to do all 256 cases

PdawgOP•5mo ago

taking a look at the sharplab code rn - that r16 block is huge 😭, but you can get better code gen if you remove the exception...and its a relative jump anyways so most of this code is not actually called at runtime

PdawgOP•5mo ago

going through and implementing the new stuff

PdawgOP•5mo ago

💀 ive been doing this for 3 hours and it still doesnt work the instruction logic all looks the same and the reg set appears to be working

cap5lut•5mo ago

yeah on its own it looks quite big, but the actual inlined code is much smaller, if u use constants:

public class Cpu
{
    private RegisterSet _registers = new();

    public void LD_R_R(ref byte dst, ref byte src) => dst = src;
    public void LD_R_R(ref ushort dst, ref ushort src) => dst = src;

    public void LD_C_A() => LD_R_R(ref _registers.C, ref _registers.A);
    
    public void Test()
    {
        LD_R_R(ref _registers.GetR16FromIndex(1), ref _registers.GetR16FromIndex(2));
    }
}

public class Cpu
{
    private RegisterSet _registers = new();

    public void LD_R_R(ref byte dst, ref byte src) => dst = src;
    public void LD_R_R(ref ushort dst, ref ushort src) => dst = src;

    public void LD_C_A() => LD_R_R(ref _registers.C, ref _registers.A);
    
    public void Test()
    {
        LD_R_R(ref _registers.GetR16FromIndex(1), ref _registers.GetR16FromIndex(2));
    }
}

Cpu.Test()
    L0000: add ecx, 4
    L0003: mov eax, ecx
    L0005: cmp [eax], al
    L0007: add eax, 2
    L000a: movzx edx, word ptr [ecx+4]
    L000e: mov [eax], dx
    L0011: ret

Cpu.Test()
    L0000: add ecx, 4
    L0003: mov eax, ecx
    L0005: cmp [eax], al
    L0007: add eax, 2
    L000a: movzx edx, word ptr [ecx+4]
    L000e: mov [eax], dx
    L0011: ret

PdawgOP•5mo ago

yeah. I mean I don’t even need to use that method at all anymore with the refs

cap5lut•5mo ago

indeed xD

PdawgOP•5mo ago

Since the instructions that handle 16bit regs use ref ushort but yeah idk why but the values kept getting shifted around, like what was supposed to be in C appeared in B and C got a different value, and D was behaving weirdly as well I don’t have a screenshot, I’ll get one tmrw

PdawgOP•5mo ago

here's an example left is wrong, right is correct

PdawgOP•5mo ago

this is the entire reg set

cap5lut•5mo ago

the only difference is the program counter right?

PdawgOP•5mo ago

yes notice how the low byte is in the high byte and the low byte is just nothing the actual instruction logic is fine same as it has been

PdawgOP•5mo ago

cap5lut•5mo ago

hmm looks weird for sure, especially as u dont touch PCs as byte registers uhhmmm the hi/lo of SP and following are in wrong order the "low" is the latter byte of both in little endian

PdawgOP•5mo ago

ah whoops I’ll fix it. but I still don’t think that’s the issue as I don’t recall them being used anywhere in documented instructions no clue why PC is behaving like that tho @cap5lut I got it working! (And as I say that, the processor just halted. Lol. I guess I still have some work to do) but for the time that it does run, it hits 80 MIPS on my laptop

PdawgOP•5mo ago

the new registers are barely faster than the old ones i really think this is the absolute limit the jit cant optimize it any more

cap5lut•5mo ago

wait, so its actually slower than the old version?

PdawgOP•5mo ago

no “on my laptop” it got like 70-75MIPS on the old version on my laptop this is my desktop

cap5lut•5mo ago

aaah dont scare me like that 😂

PdawgOP•5mo ago

lol the refactoring wasn’t too bad tbh I just have to reimplement some undocumented instructions

cap5lut•5mo ago

thats always "fun" 😂

PdawgOP•5mo ago

meh there’s only a handful, just some in the index table I already know how I’m gonna do it I’m just not on my pc I think the next step will be to optimize the actual logic, it’s pretty much the only thing left to change

cap5lut•5mo ago

or write a recompiler ;p

PdawgOP•5mo ago

I might for a simpler processor ~~dynarec for the MC14500B~~ finally got home, registers work perfectly now thanks @cap5lut!

PdawgOP•5mo ago

bugfixing in progress!

Gaming

Programming

Micro-optimizing a Z80 emulators' pipeline. **Unsafe code**

Did you find this page helpful?

Micro-optimizing a Z80 emulators' pipeline. Unsafe code