Micro-optimizing a Z80 emulators' pipeline. **Unsafe code**
so, i'm writing an emulator, and i'm trying to squeeze as much perf as i can out of hot paths. this seemingly simple fetch operation consumes about a third of the CPU time:
my memory class looks like this:
i know, it's a bit messy, and it's not really safe either, but boy does it give some perf gains. it's worth it. also, the array wont change size so it should be alright.
my registers are in a similar situation. the register set is actually an array of bytes that are accessed using constant indexers into said array, like this:
but, this is a Z80, meaning it also has 16-bit register pairs. this is important, because you can either access it as its high and low parts, or its entire pair, meaning that the exposed pairs depend on this same array, so i implemented them using properties
with all of this in mind, how can i make that fetch instruction faster and use less CPU time?
151 Replies
This might be something for #allow-unsafe-blocks
should i forward a link to this thread, or repost there?
Probably forward it maybe?
just say that you were redirected there
okay
if that's your Fetch? there really isn't much you can do
there's no crazy pointer magic left for me to do?
they're not magic, sadly
lol
what about interfaces? is there anything i can do to optimize properties in interfaces?
the interrupt handler is another hot path that also is unavoidable
interfaces?
the databus is represented by an interface
it has to be because this emulator is (going to be) a library
Fetch/Read wouldn't happen to be interface methods, would they
no no lol
^
im talking about properties here
:j_phew:
^ this
You can use $paste to send full code snippets
If your code is too long, you can post to https://paste.mod.gg/, save, and copy the link into chat for others to see your shared code!
I mean interfaces aren't fast, but there's not much you can do about that
simply checking both every clock cycle takes a serious amount of CPU time
it has to be done tho cause, ya know, thats how cpus work
15% 😭
hmm - i guess ill pack em into a byte or something so that its only reading once per instruction
marginal improvement but it gave me another ~600k instructions per second
so ill take it
this is fast enough righttt?
this is in release mode
100 million instructions a second, it's good enough. way better than anything else i've tried in C#. the fastest one behind Z80Sharp was 30 million/s
I'd say that anything around an order of magnitude loss is expected for CPU emulation. Pushing much further than you are will probably require codegen to native.
i actually haven't tried AoT yet, i wonder how fast it'd go
as expected, its a bit slower. i could work towards AoT perf more but it's whatever
I don't mean AOT. That's going to run slower at steady state than a very warm JIT.
I mean to codegen the z80 into x64 assembly and then run x64 assembly. Sometimes called dynamic recompilation.
ah that’s what I was about to ask
I am terrible at modern x86(_64) assembly so I’ll steer clear lol. z80 and ARMv6 ASM is where it’s at 😭
You can't do this across the entire software being emulated, but you can for stretches if you have good detection. Software also tends to codegen in RAM, so you need to ensure that it's the same when you rerun cached x64 output.
earlier today I was doing some bug fixing and finally got BASIC to actually execute though! RLCA and inc ix/iy were messed up (idk how I even messed that up, it’s so simple. wasn’t paying attention I guess.)
more guessing here, but if u store the register array internally as
ushort[]
instead of byte[]
,
it will be aligned to 2 byte, then for the 16 bit register u can do just one aligned read/write, instead of reading/writing/bit shifting the bytes
and if u use an inline array for the registers and pin ProcessorRegisters
, u can probably also safe one indirection, which might improve performanceI assume it's that way because they didn't want to worry about endianness on big endian systems.
nothing a bit of
BinaryPrimitives
usage cant fix ;pI like where you’re going, but most registers are used in their 8-bit forms, so it makes sense to leave them as such. The reason I specifically optimized the pairs was because the PC register is constantly having to be read from/written to. I may actually move it out to its own ushort tho, as i don’t think I really need to address it as individual 8-bit chunks
well, u can still keep the byte pointer to read the 8 bit registers individually, it would only benefit that ushort, but has no effects on the others.
having them in the same continuous memory will also help with caching, i would assume, so having them all on the same cache line might be important
I pinned the byte array. is it inline by default? or does indexing into it like
pRegSet[7] //A indexer is 7
still require some lookup
I’d assume its inline - that would make the most sensepinning an array just means the gc wont move it when compacting, thats irrelevant for the cpu cache.
but if u have for example
u ur array is somewhere else in the memory than ur
ushort
registeryeah
I know what pinning does, but I’m asking if the elements in an array are already inline. normally, the standard allocator in C or smth gives you a contiguous block of memory. is that not the same in C#?
See https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-12#inline-arrays
A type will otherwise hold a pointer to an array.
ooh that looks cool
but the only thing I don’t like is having to define each element individually
with something like this u can keep the registers close to each other:
ah it just generates the whole set for you?
afaik as long as the array size is as max as big as the page size, they will be physically continuous
I’ll try this tmrw, thanks
yes, and it will take care of aligning correctly
awesome
i wonder if exposing the system RAM in a similar way could have perf gains
An exterior array on the heap*
@cap5lut i tried an inline array and broke 100MIPS! however, i think we can go faster. this is my current impl:
which is used like this:
it doesnt seem like you can pin an inline array since its a value type, so you cant cache the pointer to it. unfortunately, the GetRegisterRef code is using a fair bit of CPU time now
the generated code has a lot of overhead
Pretty sure
ref RegisterSet[index]
should just work?would that introduce bound checks?
the answer is yes
Not sure if you really want to eliminate those...
i do
this array is a fixed size
and is only accessible internally
they tank the perf
I'd try to skip the shift + or in favor of a single 2-byte read. I'm assuming these registers are contiguous in the array. Then all you need to do is use
BinaryPrimitives
to ensure the correct endian.What even are you using to benchmark? Surely you're not replying on a stopwatch or something
im using visual studios profiler
oh you mean for the instr/s
it's exactly what it says it is
how many instructions executed in the last second
theres a cycle counter
they are indeed, but i need unsafe access to avoid bound checking
which doesnt seem to be possible w/o using
return ref Unsafe.Add(ref Unsafe.As<RegisterArray, byte>(ref RegisterSet), index);
atleast in the context of an inline array
going back to a regular array gives me the ability to cache a pointer, but it's still a bit slower than the current implMaybe the pointer route, but I'm not really sure.
I mean you could try a fixed size array?
how would you pin that on the heap tho?
you cant use a fixed buffer in an unfixed expression
I'm not really sure how you're looking to use it
hmm, i can get an unsafe ref using fixed though
meh, its slower
i just need an array that i can access directly without using bound checks. but i think Klarth's idea with 2-byte reads sounds like a good idea
i need to try that
nativememory and a pointer
hmmm good idea
i just don't really believe that a fixed buffer is slower, it should be faster in theory...
right?
a pointer is just as fast for anything as long as its in cache
right i mean if you're gonna go with nativememory i don't think much is beating it once the block is allocated
i swear nativememory will be the death of me
Remember to free what you alloc xP
lol yeah
wait nvm im dumb i just wrote this wrong
ive been doing this for too long
ero
REPL Result: Failure
Exception: CompilationErrorException
Compile: 383.812ms | Execution: 0.000ms | React with ❌ to remove this embed.
Is what I'm thinking
Oh whoops
Also an inline array in a class is crazy
lol dw its not actually like that in the emu
this is a random test project
i tried that - and it works...sorta? but soon after execution starts it just dies and idk why
i have a feeling theres some memory relocation going on that i don't know about
and its breaking everything
????
uh
yeah don't do that
fixed only fixes something until the block ends
yeah figured that out
sooo, it works, but the endianness is flipped. is there an efficient way to flip it upon reading?
ret is the correct value which is what i was doing before
either do what you do with ret or call BinaryPrimtives.ReverseEndianness
why is the compiler doing this
this mysterious function
omfg the warning
WHY 😭
the current standings. I’ll have to try optimizing the R16 getter method with tanner’s suggestion, but this is looking pretty good! single threaded on a ryzen 7 5800x
nice 😄 do u have an online repo? im quite interested in peeking at it 😄
it’s private rn, I’ll open it when the emulator core is more or less complete
kk
this was btw my thought on how to model the register stuff:
with
FlagRegister
being a little helper to set the flags more easily:
I have flag register stuff down already
idk if this would be faster tbh - inlinearray has some serious speed gains. it’s even faster than NativeMemory and a pointer
I’m eager to try tanners suggestion, always love to see that cpu time go down
I think it’ll help a lot because a lot of different instructions depend on GetR16FromHighIndexer
hmmm i forgot about the index lookup
doesnt seem too shabby 😄
not terrible
i think it's time this journey comes to an end. it maintains a stable ~120 MIPS on my R7 5700X.
i still have to actually fix faulty instruction logic and do a lot of cleanup before it's ready for release, so i think it makes sense to be done with the optimization. when i open source it, feel free to test out optimizations yourselves and make a PR if it doesn't destroy the code too much. thank you all for the suggestions! before optimizing, the CPU time was a mess (see the second image), and it was barely able to hit 5MIPS in debug mode. now, in debug mode, it runs at around 15MIPS. release mode was a similar story, running at just 70MIPS (maybe less before i started, i never thought to check) even with minimal optimization. these past few days have been a great learning experience. thanks again!
here's a full zexdoc run with profiling enabled in debug mode. there's really not much that can be done about the inlinearray stuff unfortunately, i've tried using constants as much as i can.
GetR16FromHighIndexer
makes heavy use of that function so that's why it's up there. but overall, i'm satisfied. this took just under 7 minutes. in release mode, it completes the test in about 50 seconds - on real hardware, it takes literal hours!whats the current code of
GetR16FromHighIndexer
?same as it has been. I’ve tried using raw pointers, the constant incremented indexers, and the Unsafe.As stuff. somehow, even with this overhead, the JIT is able to optimize the regular inlinearray code better than anything else
I would try your idea with the ref stuff, but the registers are also addressed individually a lot, not just in pairs
i had
ref byte
properties for the 8 bit registers as wellcurrently r16 looks like this
but the pairs are composed of their 8 bit subsets
nvm yours accounts for that
yeah, all 3 properties access the same
ushort
yup
now, the other issue is, instructions are decoded in a way that makes it easy to use “generic” functions
so like LD_R_R can cover multiple instructions
that’s why I use an array in the first place
LD_R_R takes in
byte dest, byte source
and just accesses the array with those inputsbut the actual instruction is something like
LD C, A
, so u could do something like
and call case <opcode>: LD_R_R(ref _registers.C, ref _registers.A); break;
ah
that might actually be faster yeah
a whole hell of a lot of refactoring tho
the 8 bit registers might be in wrong order there btw 😂
they are, yes, D comes before E, but it’s fine lol
for some reason the high register in the pair is a lower indexer
except for the
AF
pair, F comes before Awell, makes sense for little endian
how much faster do you think that impl would be, if at all?
ig I should just write a test and use benchmark.net
i have no clue if its faster at all 😂
lol I’ll look tmrw
what i can say is, due to mostly handling
ref
s, the JIT can probably optimize better, because it doesnt really have to look into what happens with the value u read and where it ends upthe only thing that I’m concerned about is the use of
Unsafe.As
and stuff every time it’s accessed
and Unsafe.Addif u look at this modified sharplab, u can see in the asm, that
Unsafe.Add
will be resolved to a constant offset.
Unsafe.As
doesnt even emit any code, thats purely "meta info" for roslyn and the JITah cool
I’m on mobile rn lol I would’ve looked otherwise
thats basically the asm for
byte ptr [esp+5]
is ref registers.A
nice
how does esp get the base ptr tho?
or is that on the stack
I don’t know my x86 registers well
since i made it a method that takes
RegisterSet
as a parameter, i guess due to managed calling convention the struct is spilled onto the stack and esp
gets a pointer to that, before the actual call
happensyeah I think ref structs are on the stack
makes sense, ESP is apparently the main stack pointer
i hate x86_64
lots of optimizations/inlining done there
and now
esp
is actually the Cpu this
damn. what about if you call ld_r_r from a switch? Same outcome?
in the real emu it’s like
case whatever: LD_R_R(ref _registers.C, ref _registers.A);
well, since u will have mostly a switch over almost the whole byte range, i doubt there will be much inlining
alr, I’ll benchmark it all tmrw in a test project. I gotta go to bed lol, it’s 01:30
thanks for doing all of this research!
if u have methods like
LD_C_A
that call LD_R_R
internally,
it might be better if there is no inlining, because then its basically all about keeping this
in a register and the switch will become a hug jump table
but iirc then u will really have to do all 256 case
staking a look at the sharplab code rn - that r16 block is huge 😭, but you can get better code gen if you remove the exception...and its a relative jump anyways so most of this code is not actually called at runtime
going through and implementing the new stuff
💀
ive been doing this for 3 hours and it still doesnt work
the instruction logic all looks the same and the reg set appears to be working
yeah on its own it looks quite big, but the actual inlined code is much smaller, if u use constants:
yeah. I mean I don’t even need to use that method at all anymore with the refs
indeed xD
Since the instructions that handle 16bit regs use
ref ushort
but yeah idk why but the values kept getting shifted around, like what was supposed to be in C appeared in B and C got a different value, and D was behaving weirdly as well
I don’t have a screenshot, I’ll get one tmrwhere's an example
left is wrong, right is correct
this is the entire reg set
the only difference is the program counter right?
yes
notice how the low byte is in the high byte
and the low byte is just nothing
the actual instruction logic is fine
same as it has been
hmm looks weird for sure, especially as u dont touch PCs as byte registers
uhhmmm
the
hi
/lo
of SP
and following are in wrong order
the "low" is the latter byte of both in little endianah whoops I’ll fix it. but I still don’t think that’s the issue as I don’t recall them being used anywhere in documented instructions
no clue why PC is behaving like that tho
@cap5lut I got it working!
(And as I say that, the processor just halted. Lol. I guess I still have some work to do)
but for the time that it does run, it hits 80 MIPS on my laptop
the new registers are barely faster than the old ones
i really think this is the absolute limit
the jit cant optimize it any more
wait, so its actually slower than the old version?
no
“on my laptop”
it got like 70-75MIPS on the old version on my laptop
this is my desktop
aaah
dont scare me like that 😂
lol
the refactoring wasn’t too bad tbh
I just have to reimplement some undocumented instructions
thats always "fun" 😂
meh there’s only a handful, just some in the index table
I already know how I’m gonna do it I’m just not on my pc
I think the next step will be to optimize the actual logic, it’s pretty much the only thing left to change
or write a recompiler ;p
I might for a simpler processor
dynarec for the MC14500B
finally got home, registers work perfectly now
thanks @cap5lut!
bugfixing in progress!