List of SIMD bug

The following simple code doesn't work, is this because of memory? I find the documentation is lacking, it only says SIMD are restricted to powers of 2, which I compy with...
alias S: Int = 256
alias S2: Int = S*S
alias SquareMatrix = SIMD[size=S2]

def main():
var sqms = List[SquareMatrix[DType.float32]]()
for ind in range(2): # using 1 instead of 2 gives no issues
sqm = SquareMatrix[DType.float32]()
sqms.append(sqm)
print('success!') # we never get here :(
alias S: Int = 256
alias S2: Int = S*S
alias SquareMatrix = SIMD[size=S2]

def main():
var sqms = List[SquareMatrix[DType.float32]]()
for ind in range(2): # using 1 instead of 2 gives no issues
sqm = SquareMatrix[DType.float32]()
sqms.append(sqm)
print('success!') # we never get here :(
46 Replies
sora
sora2mo ago
Looks like a bug. Could you please file an issue on Github?
franchesoni
franchesoniOP2mo ago
sure
sora
sora2mo ago
Thanks To be clear, I don’t think this code should work, it’s just maybe the compiler can fail more gracefully
Darkmatter
Darkmatter2mo ago
LLVM ERROR: SmallVector unable to grow. Requested capacity (18446744073709518848) is larger than maximum value for size type (4294967295)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0. Running pass 'Function Pass Manager' on module ''.
1. Running pass 'X86 DAG->DAG Instruction Selection' on function '@main'
Segmentation fault (core dumped)
LLVM ERROR: SmallVector unable to grow. Requested capacity (18446744073709518848) is larger than maximum value for size type (4294967295)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0. Running pass 'Function Pass Manager' on module ''.
1. Running pass 'X86 DAG->DAG Instruction Selection' on function '@main'
Segmentation fault (core dumped)
It looks like it's trying to constant fold the loop.
franchesoni
franchesoniOP2mo ago
Thanks for the responsiveness! it isn't clear to me from the docs why the memory is not enough for SIMD (while for a pointer it is enough)
Darkmatter
Darkmatter2mo ago
This is blowing up at compile time, not runtime. My guess is that a bunch of copies of the matrices are being made as part of optimizations, and you're overflowing some limit somewhere.
franchesoni
franchesoniOP2mo ago
so it wasn't my fault? surprising...
Darkmatter
Darkmatter2mo ago
This also has the issue:
def main():
print(SIMD[DType.float32, 65536]())
def main():
print(SIMD[DType.float32, 65536]())
Even with the optimizer turned off.
sora
sora2mo ago
I don’t think people should use simd like this though
Darkmatter
Darkmatter2mo ago
It doesn't matter, the compiler is blowing up in a bad way here. The requested size is 0xffffffffffff8000
sora
sora2mo ago
It does. Compiler passes have certain assumptions made wrt the thing it tries to do Or put it another way, a upper limit should be set
franchesoni
franchesoniOP2mo ago
this makes sense, but if the docs don't set what the intended purpose of SIMD is, you'll get more people like me
Darkmatter
Darkmatter2mo ago
If I run that via 2's complement, it comes to 32768 (0x8000), which is a reasonable arena size. SIMD should be able to at least kind-of handle. Some of this is on Modular for advertising SIMD as equivalent to np.array.
sora
sora2mo ago
If it generates bad code, then no but SIMD is not array it suppose to be a type you can roughly pass by reg, which chris said somewhere here
Darkmatter
Darkmatter2mo ago
I mean it shouldn't crash, not that it should generate optimal code for this. Probably an "oversized SIMD" lint is in order.
sora
sora2mo ago
I mean it shouldn't even try to generate code for this. the passes that give you the most performance are usually also the passes that have very very bad asymptotic behaviour. So certain artifitial bound needs to be put on the problem size My guess is just the compiler engineer just got lazy and roll with though assumptions Kinda like how they handled UB Another example: our InlineArray type will always unroll the v in a check the compile time will blow up and possibly end up in OOM or something similar
Darkmatter
Darkmatter2mo ago
Really what needs to happen is for @register_passable to let you specify some formula for when it isn't any more, such as @register_passable(< simdbitwidth() * 4).
sora
sora2mo ago
We still did it because we don't know how to control unroll factor in the stdlib
Darkmatter
Darkmatter2mo ago
There is a point past which you blow your icache and unrolling is no longer wanted.
sora
sora2mo ago
I'm not even talking about the quality of generated code, just the fact that we are trying to generate code in this scenario is already bad I think the simd problem first and foremost is this case. We won't reach the point to even talk about generated code, the IR is already too big to process
Darkmatter
Darkmatter2mo ago
We effectively want an "@unroll_count()". If you have enough microarchitectural information in sys.info, you can get the rest from there.
sora
sora2mo ago
To help with the InlineArray problem, yes. But for SIMD, i don't know what's needed. Any pass can be the place where the bloat happns
Darkmatter
Darkmatter2mo ago
What I'm considering is that, given branch predictors are generally pretty good, there aren't many reasons to unroll more than 16x the number of vector ALUs in a CPU.
sora
sora2mo ago
What I know is that we used to have @unroll(n) where n is a unroll factor
Darkmatter
Darkmatter2mo ago
For a BIG cpu, I don't see more than 4 vector ALUs happening. I think that mapped to unroll_count inside of LLVM. You can always do it manually, but it's not great. And makes the code messy.
sora
sora2mo ago
That's very fair. And I was confused in the beginning as well. Would you please also mention that in the issue? Thanks a lot!
Darkmatter
Darkmatter2mo ago
However, the fact that large SIMD instances run out of memory even without loops is an issue.
sora
sora2mo ago
Which is exactly why we moved away from it, IIUC. such a count is a hint to LLVM, but i think they want to do it entirely or almost entirely deterministically in Mojo itself. I think you can find recode of this in the change log
Darkmatter
Darkmatter2mo ago
That is perfectly good. But, missing the feature is not great.
sora
sora2mo ago
We will have that back some day in the future I believe
Darkmatter
Darkmatter2mo ago
I think Mojo needs something like FlexSIMD which doesn't have the "power of 2" restriction and uses kernels that self-tune based on compile-time info to process in blocks. That can get treated like numpy arrays instead of mapping directly to hardware.
sora
sora2mo ago
I don't think that's supported by LLVM though? SIMD lowers to <n x type> which is incapable of representing that
franchesoni
franchesoniOP2mo ago
an equivalent to np ndarray would be very nice, ndbuffer is not too intuitive
Darkmatter
Darkmatter2mo ago
FlexSIMD would be drain loops. It's effectively going to be an InlineArray[N, Scalar[dtype]], aligned to whatever is needed by the platform, Then you write drain loops for everything. This means no support for some of the more exotic operations like Galois Field Affine Transformations, but the basics should work fine.
sora
sora2mo ago
I was talking about the LLVM type. Just checked and my claim was wrong though, llvm seems to support <vscale x n x i32> types where vscale is not burned into the IR
Darkmatter
Darkmatter2mo ago
iirc vscale is part of the architecture. That's the vector width component of SVE and RSICVV.
sora
sora2mo ago
No description
sora
sora2mo ago
Look like it's "just a number"
Darkmatter
Darkmatter2mo ago
Is arm vscale_range(1, 16)?
sora
sora2mo ago
No description
Darkmatter
Darkmatter2mo ago
That looks a lot like information from the vector size register that ARM and RISCV have.
sora
sora2mo ago
yea, since it has to lower to something concrete after ISel But having this type representable gives me hope that we can support it in Mojo without much fuss
Darkmatter
Darkmatter2mo ago
That is good, and something I've been after, but that still won't handle the "I have a massive array I want to do operations with" case. If Mojo continue to advertise SIMD = np.array, people will expect to load large columnar datasets into a SIMD.
sora
sora2mo ago
Yea, we should improve the doc
Firas
Firas5w ago
I am not sure about my answer but this what I think: one variable of SIMD is like a hardware vector that you can use to do single instruction on multiple data, and the maximum value for one SIMD is upper bounded. you can use check the value for your architecture using: from sys.info import simdwidthof alias simd_width = simdwidthofDType.int32 reagrding your code: yes we can do somthing like SIMDtype=DType.float32, size=size. with size>>simd_width, but I think it's somthing that mojo handle without we know. "abstraction" regarding your Matrix: you can look at the matrix_mul_mojo_example from there webiste and look how they utilize the SIMD using the vectorize built in optimization. also look at the parallelize to multiple processors....
Want results from more Discord servers?
Add your server