C
C#3mo ago
Pdawg

Micro-optimizing a Z80 emulators' pipeline. **Unsafe code**

so, i'm writing an emulator, and i'm trying to squeeze as much perf as i can out of hot paths. this seemingly simple fetch operation consumes about a third of the CPU time:
private byte Fetch()
{
return _memory.Read(Registers.PC++);
}
private byte Fetch()
{
return _memory.Read(Registers.PC++);
}
my memory class looks like this:
private GCHandle _memHandle;
private byte* pMem;
private byte[] _memory;

public MainMemory(int size)
{
// pin array and get GC ptr. omitted for brevity.
pMem = (byte*)_memHandle.AddrOfPinnedObject();
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public byte Read(ushort address) => pMem[address];
private GCHandle _memHandle;
private byte* pMem;
private byte[] _memory;

public MainMemory(int size)
{
// pin array and get GC ptr. omitted for brevity.
pMem = (byte*)_memHandle.AddrOfPinnedObject();
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public byte Read(ushort address) => pMem[address];
i know, it's a bit messy, and it's not really safe either, but boy does it give some perf gains. it's worth it. also, the array wont change size so it should be alright. my registers are in a similar situation. the register set is actually an array of bytes that are accessed using constant indexers into said array, like this:
private GCHandle _regSetHandle;
private byte* pRegSet;
public byte[] RegisterSet;
public ProcessorRegisters()
{
RegisterSet = new byte[26];
_regSetHandle = GCHandle.Alloc(RegisterSet, GCHandleType.Pinned);
pRegSet = (byte*)_regSetHandle.AddrOfPinnedObject();
}
// example
byte regA = pRegSet[Registers.A]; // A is an indexer into the array; i tried to follow the Z80's convention, so A = 7
private GCHandle _regSetHandle;
private byte* pRegSet;
public byte[] RegisterSet;
public ProcessorRegisters()
{
RegisterSet = new byte[26];
_regSetHandle = GCHandle.Alloc(RegisterSet, GCHandleType.Pinned);
pRegSet = (byte*)_regSetHandle.AddrOfPinnedObject();
}
// example
byte regA = pRegSet[Registers.A]; // A is an indexer into the array; i tried to follow the Z80's convention, so A = 7
but, this is a Z80, meaning it also has 16-bit register pairs. this is important, because you can either access it as its high and low parts, or its entire pair, meaning that the exposed pairs depend on this same array, so i implemented them using properties
public ushort PC
{
get => (ushort)((pRegSet[PCi] << 8) | pRegSet[PCiL]);
set
{
pRegSet[PCi] = (byte)(value >> 8);
pRegSet[PCiL] = (byte)value;
}
}
public ushort PC
{
get => (ushort)((pRegSet[PCi] << 8) | pRegSet[PCiL]);
set
{
pRegSet[PCi] = (byte)(value >> 8);
pRegSet[PCiL] = (byte)value;
}
}
with all of this in mind, how can i make that fetch instruction faster and use less CPU time?
151 Replies
Buddy
Buddy3mo ago
This might be something for #allow-unsafe-blocks
Pdawg
PdawgOP3mo ago
should i forward a link to this thread, or repost there?
Buddy
Buddy3mo ago
Probably forward it maybe? just say that you were redirected there
Pdawg
PdawgOP3mo ago
okay
Aaron
Aaron3mo ago
if that's your Fetch? there really isn't much you can do
Pdawg
PdawgOP3mo ago
there's no crazy pointer magic left for me to do?
Aaron
Aaron3mo ago
they're not magic, sadly
Pdawg
PdawgOP3mo ago
lol what about interfaces? is there anything i can do to optimize properties in interfaces? the interrupt handler is another hot path that also is unavoidable
Aaron
Aaron3mo ago
interfaces?
Pdawg
PdawgOP3mo ago
No description
Pdawg
PdawgOP3mo ago
the databus is represented by an interface it has to be because this emulator is (going to be) a library
Aaron
Aaron3mo ago
Fetch/Read wouldn't happen to be interface methods, would they
Pdawg
PdawgOP3mo ago
no no lol ^ im talking about properties here
Aaron
Aaron3mo ago
:j_phew:
Pdawg
PdawgOP3mo ago
/// <summary>
/// Maskable Interrupt raised
/// </summary>
bool MI { get; set; }

/// <summary>
/// Non-maskable Interrupt raised.
/// </summary>
bool NMI { get; set; }
/// <summary>
/// Maskable Interrupt raised
/// </summary>
bool MI { get; set; }

/// <summary>
/// Non-maskable Interrupt raised.
/// </summary>
bool NMI { get; set; }
^ this
Buddy
Buddy3mo ago
You can use $paste to send full code snippets
MODiX
MODiX3mo ago
If your code is too long, you can post to https://paste.mod.gg/, save, and copy the link into chat for others to see your shared code!
Aaron
Aaron3mo ago
I mean interfaces aren't fast, but there's not much you can do about that
Pdawg
PdawgOP3mo ago
simply checking both every clock cycle takes a serious amount of CPU time it has to be done tho cause, ya know, thats how cpus work
Pdawg
PdawgOP3mo ago
No description
Pdawg
PdawgOP3mo ago
15% 😭 hmm - i guess ill pack em into a byte or something so that its only reading once per instruction marginal improvement but it gave me another ~600k instructions per second so ill take it
Pdawg
PdawgOP3mo ago
this is fast enough righttt?
No description
Pdawg
PdawgOP3mo ago
this is in release mode 100 million instructions a second, it's good enough. way better than anything else i've tried in C#. the fastest one behind Z80Sharp was 30 million/s
Klarth
Klarth3mo ago
I'd say that anything around an order of magnitude loss is expected for CPU emulation. Pushing much further than you are will probably require codegen to native.
Pdawg
PdawgOP3mo ago
i actually haven't tried AoT yet, i wonder how fast it'd go
Pdawg
PdawgOP3mo ago
as expected, its a bit slower. i could work towards AoT perf more but it's whatever
No description
Klarth
Klarth3mo ago
I don't mean AOT. That's going to run slower at steady state than a very warm JIT. I mean to codegen the z80 into x64 assembly and then run x64 assembly. Sometimes called dynamic recompilation.
Pdawg
PdawgOP3mo ago
ah that’s what I was about to ask I am terrible at modern x86(_64) assembly so I’ll steer clear lol. z80 and ARMv6 ASM is where it’s at 😭
Klarth
Klarth3mo ago
You can't do this across the entire software being emulated, but you can for stretches if you have good detection. Software also tends to codegen in RAM, so you need to ensure that it's the same when you rerun cached x64 output.
Pdawg
PdawgOP3mo ago
earlier today I was doing some bug fixing and finally got BASIC to actually execute though! RLCA and inc ix/iy were messed up (idk how I even messed that up, it’s so simple. wasn’t paying attention I guess.)
cap5lut
cap5lut3mo ago
more guessing here, but if u store the register array internally as ushort[] instead of byte[], it will be aligned to 2 byte, then for the 16 bit register u can do just one aligned read/write, instead of reading/writing/bit shifting the bytes and if u use an inline array for the registers and pin ProcessorRegisters, u can probably also safe one indirection, which might improve performance
Klarth
Klarth3mo ago
I assume it's that way because they didn't want to worry about endianness on big endian systems.
cap5lut
cap5lut3mo ago
nothing a bit of BinaryPrimitives usage cant fix ;p
Pdawg
PdawgOP3mo ago
I like where you’re going, but most registers are used in their 8-bit forms, so it makes sense to leave them as such. The reason I specifically optimized the pairs was because the PC register is constantly having to be read from/written to. I may actually move it out to its own ushort tho, as i don’t think I really need to address it as individual 8-bit chunks
cap5lut
cap5lut3mo ago
well, u can still keep the byte pointer to read the 8 bit registers individually, it would only benefit that ushort, but has no effects on the others. having them in the same continuous memory will also help with caching, i would assume, so having them all on the same cache line might be important
Pdawg
PdawgOP3mo ago
I pinned the byte array. is it inline by default? or does indexing into it like pRegSet[7] //A indexer is 7 still require some lookup I’d assume its inline - that would make the most sense
cap5lut
cap5lut3mo ago
pinning an array just means the gc wont move it when compacting, thats irrelevant for the cpu cache. but if u have for example
struct Registers
{
byte[] _registers;
ushort _otherRegister;
}
struct Registers
{
byte[] _registers;
ushort _otherRegister;
}
u ur array is somewhere else in the memory than ur ushort register
Pdawg
PdawgOP3mo ago
yeah I know what pinning does, but I’m asking if the elements in an array are already inline. normally, the standard allocator in C or smth gives you a contiguous block of memory. is that not the same in C#?
Klarth
Klarth3mo ago
See https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-12#inline-arrays A type will otherwise hold a pointer to an array.
Pdawg
PdawgOP3mo ago
ooh that looks cool but the only thing I don’t like is having to define each element individually
cap5lut
cap5lut3mo ago
with something like this u can keep the registers close to each other:
public struct Registers
{
private Registers8Bit _normalRegisters;
private ushort _specifalRegister;

[InlineArray(24)]
private struct Registers8Bit
{
private byte _element0;
}
}
public struct Registers
{
private Registers8Bit _normalRegisters;
private ushort _specifalRegister;

[InlineArray(24)]
private struct Registers8Bit
{
private byte _element0;
}
}
Pdawg
PdawgOP3mo ago
ah it just generates the whole set for you?
cap5lut
cap5lut3mo ago
afaik as long as the array size is as max as big as the page size, they will be physically continuous
Pdawg
PdawgOP3mo ago
I’ll try this tmrw, thanks
cap5lut
cap5lut3mo ago
yes, and it will take care of aligning correctly
Pdawg
PdawgOP3mo ago
awesome i wonder if exposing the system RAM in a similar way could have perf gains
Klarth
Klarth3mo ago
An exterior array on the heap*
Pdawg
PdawgOP3mo ago
@cap5lut i tried an inline array and broke 100MIPS! however, i think we can go faster. this is my current impl:
[InlineArray(24)]
public struct RegisterArray
{
private byte _element0;
}
public RegisterArray RegisterSet;

public ProcessorRegisters()
{
RegisterSet = new RegisterArray();
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private ref byte GetRegisterRef(byte index)
{
return ref Unsafe.Add(ref Unsafe.As<RegisterArray, byte>(ref RegisterSet), index);
}
[InlineArray(24)]
public struct RegisterArray
{
private byte _element0;
}
public RegisterArray RegisterSet;

public ProcessorRegisters()
{
RegisterSet = new RegisterArray();
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private ref byte GetRegisterRef(byte index)
{
return ref Unsafe.Add(ref Unsafe.As<RegisterArray, byte>(ref RegisterSet), index);
}
which is used like this:
public ushort BC
{
get => (ushort)((GetRegisterRef(B) << 8) | GetRegisterRef(C));
set
{
GetRegisterRef(B) = (byte)(value >> 8);
GetRegisterRef(C) = (byte)value;
}
}
public ushort BC
{
get => (ushort)((GetRegisterRef(B) << 8) | GetRegisterRef(C));
set
{
GetRegisterRef(B) = (byte)(value >> 8);
GetRegisterRef(C) = (byte)value;
}
}
it doesnt seem like you can pin an inline array since its a value type, so you cant cache the pointer to it. unfortunately, the GetRegisterRef code is using a fair bit of CPU time now
Pdawg
PdawgOP3mo ago
the generated code has a lot of overhead
No description
ero
ero3mo ago
Pretty sure ref RegisterSet[index] should just work?
Pdawg
PdawgOP3mo ago
would that introduce bound checks?
Pdawg
PdawgOP3mo ago
the answer is yes
No description
ero
ero3mo ago
Not sure if you really want to eliminate those...
Pdawg
PdawgOP3mo ago
i do this array is a fixed size and is only accessible internally they tank the perf
Klarth
Klarth3mo ago
I'd try to skip the shift + or in favor of a single 2-byte read. I'm assuming these registers are contiguous in the array. Then all you need to do is use BinaryPrimitives to ensure the correct endian.
ero
ero3mo ago
What even are you using to benchmark? Surely you're not replying on a stopwatch or something
Pdawg
PdawgOP3mo ago
im using visual studios profiler oh you mean for the instr/s it's exactly what it says it is how many instructions executed in the last second theres a cycle counter they are indeed, but i need unsafe access to avoid bound checking which doesnt seem to be possible w/o using return ref Unsafe.Add(ref Unsafe.As<RegisterArray, byte>(ref RegisterSet), index); atleast in the context of an inline array going back to a regular array gives me the ability to cache a pointer, but it's still a bit slower than the current impl
Klarth
Klarth3mo ago
Maybe the pointer route, but I'm not really sure.
ero
ero3mo ago
I mean you could try a fixed size array?
public fixed byte RegisterSet[24];
public fixed byte RegisterSet[24];
Pdawg
PdawgOP3mo ago
how would you pin that on the heap tho? you cant use a fixed buffer in an unfixed expression
ero
ero3mo ago
I'm not really sure how you're looking to use it
Pdawg
PdawgOP3mo ago
hmm, i can get an unsafe ref using fixed though meh, its slower i just need an array that i can access directly without using bound checks. but i think Klarth's idea with 2-byte reads sounds like a good idea i need to try that
alex
alex3mo ago
nativememory and a pointer
Pdawg
PdawgOP3mo ago
hmmm good idea
ero
ero3mo ago
i just don't really believe that a fixed buffer is slower, it should be faster in theory...
public ushort BC
{
get
{
fixed (byte* pReg = RegisterSet)
return *(ushort*)(pReg + B);
}
set
{
fixed (byte* pReg = RegisterSet)
*(ushort*)(pReg + B) = value;
}
}
public ushort BC
{
get
{
fixed (byte* pReg = RegisterSet)
return *(ushort*)(pReg + B);
}
set
{
fixed (byte* pReg = RegisterSet)
*(ushort*)(pReg + B) = value;
}
}
right?
alex
alex3mo ago
a pointer is just as fast for anything as long as its in cache
ero
ero3mo ago
right i mean if you're gonna go with nativememory i don't think much is beating it once the block is allocated
Pdawg
PdawgOP3mo ago
i swear nativememory will be the death of me
ero
ero3mo ago
Remember to free what you alloc xP
Pdawg
PdawgOP3mo ago
lol yeah wait nvm im dumb i just wrote this wrong ive been doing this for too long
MODiX
MODiX3mo ago
ero
REPL Result: Failure
unsafe readonly struct S
{
private readonly byte* _registerSet;

public S()
{
_registerSet = (byte*)NativeMemory.AllocZeroed(26);
}

public ushort BC
{
get => *(ushort*)(_registerSet + B);
set => *(ushort*)(_registerSet + B) = value;
}

public void Dispose()
{
NativeMemory.Free(_registerSet);
}
}
unsafe readonly struct S
{
private readonly byte* _registerSet;

public S()
{
_registerSet = (byte*)NativeMemory.AllocZeroed(26);
}

public ushort BC
{
get => *(ushort*)(_registerSet + B);
set => *(ushort*)(_registerSet + B) = value;
}

public void Dispose()
{
NativeMemory.Free(_registerSet);
}
}
Exception: CompilationErrorException
- The name 'B' does not exist in the current context
- The name 'B' does not exist in the current context
- The name 'B' does not exist in the current context
- The name 'B' does not exist in the current context
Compile: 383.812ms | Execution: 0.000ms | React with ❌ to remove this embed.
ero
ero3mo ago
Is what I'm thinking Oh whoops Also an inline array in a class is crazy
Pdawg
PdawgOP3mo ago
lol dw its not actually like that in the emu this is a random test project i tried that - and it works...sorta? but soon after execution starts it just dies and idk why i have a feeling theres some memory relocation going on that i don't know about and its breaking everything
Pdawg
PdawgOP3mo ago
No description
Pdawg
PdawgOP3mo ago
No description
Pdawg
PdawgOP3mo ago
????
Aaron
Aaron3mo ago
uh yeah don't do that fixed only fixes something until the block ends
Pdawg
PdawgOP3mo ago
yeah figured that out sooo, it works, but the endianness is flipped. is there an efficient way to flip it upon reading?
Pdawg
PdawgOP3mo ago
No description
Pdawg
PdawgOP3mo ago
ret is the correct value which is what i was doing before
Aaron
Aaron3mo ago
either do what you do with ret or call BinaryPrimtives.ReverseEndianness
Pdawg
PdawgOP3mo ago
why is the compiler doing this
No description
Pdawg
PdawgOP3mo ago
No description
Pdawg
PdawgOP3mo ago
this mysterious function omfg the warning
Pdawg
PdawgOP3mo ago
WHY 😭
No description
Pdawg
PdawgOP3mo ago
the current standings. I’ll have to try optimizing the R16 getter method with tanner’s suggestion, but this is looking pretty good! single threaded on a ryzen 7 5800x
No description
cap5lut
cap5lut3mo ago
nice 😄 do u have an online repo? im quite interested in peeking at it 😄
Pdawg
PdawgOP3mo ago
it’s private rn, I’ll open it when the emulator core is more or less complete
cap5lut
cap5lut3mo ago
kk this was btw my thought on how to model the register stuff:
public struct RegisterSet
{
private ushort _af;
private ushort _bc;
private ushort _de;
private ushort _hl;
private ushort _sp;
private ushort _pc;

[UnscopedRef] public ref ushort AF => ref _af;
[UnscopedRef] public ref byte A => ref Unsafe.Add(ref F, 1);
[UnscopedRef] public ref byte F => ref Unsafe.As<ushort, byte>(ref AF);
[UnscopedRef] public FlagRegister Flags => new FlagRegister(ref F);

[UnscopedRef] public ref ushort BC => ref _bc;
[UnscopedRef] public ref byte B => ref Unsafe.Add(ref C, 1);
[UnscopedRef] public ref byte C => ref Unsafe.As<ushort, byte>(ref BC);

[UnscopedRef] public ref ushort DE => ref _de;
[UnscopedRef] public ref byte D => ref Unsafe.Add(ref E, 1);
[UnscopedRef] public ref byte E => ref Unsafe.As<ushort, byte>(ref DE);

[UnscopedRef] public ref ushort HL => ref _hl;
[UnscopedRef] public ref byte H => ref Unsafe.Add(ref L, 1);
[UnscopedRef] public ref byte L => ref Unsafe.As<ushort, byte>(ref HL);

[UnscopedRef] public ref ushort SP => ref _sp;
[UnscopedRef] public ref byte SPlo => ref Unsafe.Add(ref SPhi, 1);
[UnscopedRef] public ref byte SPhi => ref Unsafe.As<ushort, byte>(ref SP);

[UnscopedRef] public ref ushort PC => ref _pc;
}
public struct RegisterSet
{
private ushort _af;
private ushort _bc;
private ushort _de;
private ushort _hl;
private ushort _sp;
private ushort _pc;

[UnscopedRef] public ref ushort AF => ref _af;
[UnscopedRef] public ref byte A => ref Unsafe.Add(ref F, 1);
[UnscopedRef] public ref byte F => ref Unsafe.As<ushort, byte>(ref AF);
[UnscopedRef] public FlagRegister Flags => new FlagRegister(ref F);

[UnscopedRef] public ref ushort BC => ref _bc;
[UnscopedRef] public ref byte B => ref Unsafe.Add(ref C, 1);
[UnscopedRef] public ref byte C => ref Unsafe.As<ushort, byte>(ref BC);

[UnscopedRef] public ref ushort DE => ref _de;
[UnscopedRef] public ref byte D => ref Unsafe.Add(ref E, 1);
[UnscopedRef] public ref byte E => ref Unsafe.As<ushort, byte>(ref DE);

[UnscopedRef] public ref ushort HL => ref _hl;
[UnscopedRef] public ref byte H => ref Unsafe.Add(ref L, 1);
[UnscopedRef] public ref byte L => ref Unsafe.As<ushort, byte>(ref HL);

[UnscopedRef] public ref ushort SP => ref _sp;
[UnscopedRef] public ref byte SPlo => ref Unsafe.Add(ref SPhi, 1);
[UnscopedRef] public ref byte SPhi => ref Unsafe.As<ushort, byte>(ref SP);

[UnscopedRef] public ref ushort PC => ref _pc;
}
with FlagRegister being a little helper to set the flags more easily:
public readonly ref struct FlagRegister
{
private readonly ref byte _register;
public FlagRegister(ref byte register) => _register = ref register;

public bool Test(RegisterFlag flag) => (byte)(_register & (byte)flag) == (byte)flag;
public void Set(RegisterFlag flag, bool isSet)
{
if (isSet)
{
Set(flag);
}
else
{
Reset(flag);
}
}
public void Set(RegisterFlag flag) => _register |= (byte)flag;
public void Reset() => _register = 0;
public void Reset(RegisterFlag flag) => _register &= (byte)~flag;
}
public readonly ref struct FlagRegister
{
private readonly ref byte _register;
public FlagRegister(ref byte register) => _register = ref register;

public bool Test(RegisterFlag flag) => (byte)(_register & (byte)flag) == (byte)flag;
public void Set(RegisterFlag flag, bool isSet)
{
if (isSet)
{
Set(flag);
}
else
{
Reset(flag);
}
}
public void Set(RegisterFlag flag) => _register |= (byte)flag;
public void Reset() => _register = 0;
public void Reset(RegisterFlag flag) => _register &= (byte)~flag;
}
Pdawg
PdawgOP3mo ago
I have flag register stuff down already idk if this would be faster tbh - inlinearray has some serious speed gains. it’s even faster than NativeMemory and a pointer I’m eager to try tanners suggestion, always love to see that cpu time go down I think it’ll help a lot because a lot of different instructions depend on GetR16FromHighIndexer
cap5lut
cap5lut3mo ago
hmmm i forgot about the index lookup doesnt seem too shabby 😄
MODiX
MODiX3mo ago
cap5lut
sharplab.io (click here)
static void Test(RegisterSet registers) {
ref var reg = ref registers.GetR16FromIndex(3);
reg = 255;
reg++;
registers.Flags.Reset();
registers.Flags.Set(RegisterFlag.Zero);
registers.Flags.Set(RegisterFlag.Carry);
}
public struct RegisterSet {
private ushort _af;
// 62 more lines. Follow the link to view.
static void Test(RegisterSet registers) {
ref var reg = ref registers.GetR16FromIndex(3);
reg = 255;
reg++;
registers.Flags.Reset();
registers.Flags.Set(RegisterFlag.Zero);
registers.Flags.Set(RegisterFlag.Carry);
}
public struct RegisterSet {
private ushort _af;
// 62 more lines. Follow the link to view.
React with ❌ to remove this embed.
Pdawg
PdawgOP3mo ago
not terrible
Pdawg
PdawgOP3mo ago
i think it's time this journey comes to an end. it maintains a stable ~120 MIPS on my R7 5700X. i still have to actually fix faulty instruction logic and do a lot of cleanup before it's ready for release, so i think it makes sense to be done with the optimization. when i open source it, feel free to test out optimizations yourselves and make a PR if it doesn't destroy the code too much. thank you all for the suggestions! before optimizing, the CPU time was a mess (see the second image), and it was barely able to hit 5MIPS in debug mode. now, in debug mode, it runs at around 15MIPS. release mode was a similar story, running at just 70MIPS (maybe less before i started, i never thought to check) even with minimal optimization. these past few days have been a great learning experience. thanks again!
No description
No description
Pdawg
PdawgOP3mo ago
here's a full zexdoc run with profiling enabled in debug mode. there's really not much that can be done about the inlinearray stuff unfortunately, i've tried using constants as much as i can. GetR16FromHighIndexer makes heavy use of that function so that's why it's up there. but overall, i'm satisfied. this took just under 7 minutes. in release mode, it completes the test in about 50 seconds - on real hardware, it takes literal hours!
Pdawg
PdawgOP3mo ago
No description
cap5lut
cap5lut3mo ago
whats the current code of GetR16FromHighIndexer?
Pdawg
PdawgOP3mo ago
same as it has been. I’ve tried using raw pointers, the constant incremented indexers, and the Unsafe.As stuff. somehow, even with this overhead, the JIT is able to optimize the regular inlinearray code better than anything else I would try your idea with the ref stuff, but the registers are also addressed individually a lot, not just in pairs
cap5lut
cap5lut3mo ago
i had ref byte properties for the 8 bit registers as well
Pdawg
PdawgOP3mo ago
currently r16 looks like this
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public ushort GetR16FromHighIndexer([ConstantExpected] byte indexer) => (ushort)((RegisterSet[indexer] << 8) | RegisterSet[indexer + 1]);
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public ushort GetR16FromHighIndexer([ConstantExpected] byte indexer) => (ushort)((RegisterSet[indexer] << 8) | RegisterSet[indexer + 1]);
but the pairs are composed of their 8 bit subsets nvm yours accounts for that
cap5lut
cap5lut3mo ago
yeah, all 3 properties access the same ushort
[UnscopedRef] public ref ushort HL => ref _hl;
[UnscopedRef] public ref byte H => ref Unsafe.Add(ref L, 1);
[UnscopedRef] public ref byte L => ref Unsafe.As<ushort, byte>(ref HL);
[UnscopedRef] public ref ushort HL => ref _hl;
[UnscopedRef] public ref byte H => ref Unsafe.Add(ref L, 1);
[UnscopedRef] public ref byte L => ref Unsafe.As<ushort, byte>(ref HL);
Pdawg
PdawgOP3mo ago
yup now, the other issue is, instructions are decoded in a way that makes it easy to use “generic” functions so like LD_R_R can cover multiple instructions that’s why I use an array in the first place LD_R_R takes in byte dest, byte source and just accesses the array with those inputs
cap5lut
cap5lut3mo ago
but the actual instruction is something like LD C, A, so u could do something like
public void LD_R_R(ref byte dest, ref byte source) => dest = source;
public void LD_R_R(ref byte dest, ref byte source) => dest = source;
and call case <opcode>: LD_R_R(ref _registers.C, ref _registers.A); break;
Pdawg
PdawgOP3mo ago
ah that might actually be faster yeah a whole hell of a lot of refactoring tho
cap5lut
cap5lut3mo ago
the 8 bit registers might be in wrong order there btw 😂
Pdawg
PdawgOP3mo ago
they are, yes, D comes before E, but it’s fine lol for some reason the high register in the pair is a lower indexer except for the AF pair, F comes before A
cap5lut
cap5lut3mo ago
well, makes sense for little endian
Pdawg
PdawgOP3mo ago
how much faster do you think that impl would be, if at all? ig I should just write a test and use benchmark.net
cap5lut
cap5lut3mo ago
i have no clue if its faster at all 😂
Pdawg
PdawgOP3mo ago
lol I’ll look tmrw
cap5lut
cap5lut3mo ago
what i can say is, due to mostly handling refs, the JIT can probably optimize better, because it doesnt really have to look into what happens with the value u read and where it ends up
Pdawg
PdawgOP3mo ago
the only thing that I’m concerned about is the use of Unsafe.As and stuff every time it’s accessed and Unsafe.Add
cap5lut
cap5lut3mo ago
if u look at this modified sharplab, u can see in the asm, that Unsafe.Add will be resolved to a constant offset. Unsafe.As doesnt even emit any code, thats purely "meta info" for roslyn and the JIT
MODiX
MODiX3mo ago
cap5lut
sharplab.io (click here)
static void Test(RegisterSet registers) {
ref var reg = ref registers.A;
reg = 255;
reg++;
registers.Flags.Reset();
registers.Flags.Set(RegisterFlag.Zero);
registers.Flags.Set(RegisterFlag.Carry);
}
public struct RegisterSet {
private ushort _af;
// 62 more lines. Follow the link to view.
static void Test(RegisterSet registers) {
ref var reg = ref registers.A;
reg = 255;
reg++;
registers.Flags.Reset();
registers.Flags.Set(RegisterFlag.Zero);
registers.Flags.Set(RegisterFlag.Carry);
}
public struct RegisterSet {
private ushort _af;
// 62 more lines. Follow the link to view.
React with ❌ to remove this embed.
Pdawg
PdawgOP3mo ago
ah cool I’m on mobile rn lol I would’ve looked otherwise
cap5lut
cap5lut3mo ago
L0000: mov byte ptr [esp+5], 0xff
L0005: movzx eax, byte ptr [esp+5]
L000a: inc eax
L000b: mov [esp+5], al
L0000: mov byte ptr [esp+5], 0xff
L0005: movzx eax, byte ptr [esp+5]
L000a: inc eax
L000b: mov [esp+5], al
thats basically the asm for
ref var reg = ref registers.A;
reg = 255;
reg++;
ref var reg = ref registers.A;
reg = 255;
reg++;
byte ptr [esp+5] is ref registers.A
Pdawg
PdawgOP3mo ago
nice how does esp get the base ptr tho? or is that on the stack I don’t know my x86 registers well
cap5lut
cap5lut3mo ago
since i made it a method that takes RegisterSet as a parameter, i guess due to managed calling convention the struct is spilled onto the stack and esp gets a pointer to that, before the actual call happens
Pdawg
PdawgOP3mo ago
yeah I think ref structs are on the stack makes sense, ESP is apparently the main stack pointer i hate x86_64
MODiX
MODiX3mo ago
cap5lut
sharplab.io (click here)
public class Cpu {
private RegisterSet _registers = new();
public void LD_R_R(ref byte dst, ref byte src) => dst ...
public void LD_C_A() => LD_R_R(ref _registers.C, ref _...
}
public struct RegisterSet {
private ushort _af;
private ushort _bc;
private ushort _de;
private ushort _hl;
// 59 more lines. Follow the link to view.
public class Cpu {
private RegisterSet _registers = new();
public void LD_R_R(ref byte dst, ref byte src) => dst ...
public void LD_C_A() => LD_R_R(ref _registers.C, ref _...
}
public struct RegisterSet {
private ushort _af;
private ushort _bc;
private ushort _de;
private ushort _hl;
// 59 more lines. Follow the link to view.
React with ❌ to remove this embed.
cap5lut
cap5lut3mo ago
lots of optimizations/inlining done there
Cpu.LD_R_R(Byte ByRef, Byte ByRef)
L0000: mov eax, [esp+4]
L0004: movzx eax, byte ptr [eax]
L0007: mov [edx], al
L0009: ret 4

Cpu.LD_C_A()
L0000: movzx eax, byte ptr [ecx+5]
L0004: mov [ecx+6], al
L0007: ret
Cpu.LD_R_R(Byte ByRef, Byte ByRef)
L0000: mov eax, [esp+4]
L0004: movzx eax, byte ptr [eax]
L0007: mov [edx], al
L0009: ret 4

Cpu.LD_C_A()
L0000: movzx eax, byte ptr [ecx+5]
L0004: mov [ecx+6], al
L0007: ret
and now esp is actually the Cpu this
Pdawg
PdawgOP3mo ago
damn. what about if you call ld_r_r from a switch? Same outcome? in the real emu it’s like case whatever: LD_R_R(ref _registers.C, ref _registers.A);
cap5lut
cap5lut3mo ago
well, since u will have mostly a switch over almost the whole byte range, i doubt there will be much inlining
Pdawg
PdawgOP3mo ago
alr, I’ll benchmark it all tmrw in a test project. I gotta go to bed lol, it’s 01:30 thanks for doing all of this research!
cap5lut
cap5lut3mo ago
if u have methods like LD_C_A that call LD_R_R internally, it might be better if there is no inlining, because then its basically all about keeping this in a register and the switch will become a hug jump table but iirc then u will really have to do all 256 cases
Pdawg
PdawgOP3mo ago
taking a look at the sharplab code rn - that r16 block is huge 😭, but you can get better code gen if you remove the exception...and its a relative jump anyways so most of this code is not actually called at runtime
No description
Pdawg
PdawgOP3mo ago
going through and implementing the new stuff
No description
Pdawg
PdawgOP3mo ago
💀 ive been doing this for 3 hours and it still doesnt work the instruction logic all looks the same and the reg set appears to be working
cap5lut
cap5lut3mo ago
yeah on its own it looks quite big, but the actual inlined code is much smaller, if u use constants:
public class Cpu
{
private RegisterSet _registers = new();

public void LD_R_R(ref byte dst, ref byte src) => dst = src;
public void LD_R_R(ref ushort dst, ref ushort src) => dst = src;

public void LD_C_A() => LD_R_R(ref _registers.C, ref _registers.A);

public void Test()
{
LD_R_R(ref _registers.GetR16FromIndex(1), ref _registers.GetR16FromIndex(2));
}
}
public class Cpu
{
private RegisterSet _registers = new();

public void LD_R_R(ref byte dst, ref byte src) => dst = src;
public void LD_R_R(ref ushort dst, ref ushort src) => dst = src;

public void LD_C_A() => LD_R_R(ref _registers.C, ref _registers.A);

public void Test()
{
LD_R_R(ref _registers.GetR16FromIndex(1), ref _registers.GetR16FromIndex(2));
}
}
Cpu.Test()
L0000: add ecx, 4
L0003: mov eax, ecx
L0005: cmp [eax], al
L0007: add eax, 2
L000a: movzx edx, word ptr [ecx+4]
L000e: mov [eax], dx
L0011: ret
Cpu.Test()
L0000: add ecx, 4
L0003: mov eax, ecx
L0005: cmp [eax], al
L0007: add eax, 2
L000a: movzx edx, word ptr [ecx+4]
L000e: mov [eax], dx
L0011: ret
Pdawg
PdawgOP3mo ago
yeah. I mean I don’t even need to use that method at all anymore with the refs
cap5lut
cap5lut3mo ago
indeed xD
Pdawg
PdawgOP3mo ago
Since the instructions that handle 16bit regs use ref ushort but yeah idk why but the values kept getting shifted around, like what was supposed to be in C appeared in B and C got a different value, and D was behaving weirdly as well I don’t have a screenshot, I’ll get one tmrw
Pdawg
PdawgOP3mo ago
No description
Pdawg
PdawgOP3mo ago
here's an example left is wrong, right is correct
Pdawg
PdawgOP3mo ago
this is the entire reg set
No description
cap5lut
cap5lut3mo ago
the only difference is the program counter right?
Pdawg
PdawgOP3mo ago
yes notice how the low byte is in the high byte and the low byte is just nothing the actual instruction logic is fine same as it has been
Pdawg
PdawgOP3mo ago
No description
cap5lut
cap5lut3mo ago
hmm looks weird for sure, especially as u dont touch PCs as byte registers uhhmmm the hi/lo of SP and following are in wrong order the "low" is the latter byte of both in little endian
Pdawg
PdawgOP3mo ago
ah whoops I’ll fix it. but I still don’t think that’s the issue as I don’t recall them being used anywhere in documented instructions no clue why PC is behaving like that tho @cap5lut I got it working! (And as I say that, the processor just halted. Lol. I guess I still have some work to do) but for the time that it does run, it hits 80 MIPS on my laptop
Pdawg
PdawgOP3mo ago
No description
Pdawg
PdawgOP3mo ago
the new registers are barely faster than the old ones i really think this is the absolute limit the jit cant optimize it any more
cap5lut
cap5lut3mo ago
wait, so its actually slower than the old version?
Pdawg
PdawgOP3mo ago
no “on my laptop” it got like 70-75MIPS on the old version on my laptop this is my desktop
cap5lut
cap5lut3mo ago
aaah dont scare me like that 😂
Pdawg
PdawgOP3mo ago
lol the refactoring wasn’t too bad tbh I just have to reimplement some undocumented instructions
cap5lut
cap5lut3mo ago
thats always "fun" 😂
Pdawg
PdawgOP3mo ago
meh there’s only a handful, just some in the index table I already know how I’m gonna do it I’m just not on my pc I think the next step will be to optimize the actual logic, it’s pretty much the only thing left to change
cap5lut
cap5lut3mo ago
or write a recompiler ;p
Pdawg
PdawgOP3mo ago
I might for a simpler processor dynarec for the MC14500B finally got home, registers work perfectly now thanks @cap5lut!
Pdawg
PdawgOP3mo ago
bugfixing in progress!
No description

Did you find this page helpful?