Modular•7mo ago

List of SIMD bug

The following simple code doesn't work, is this because of memory? I find the documentation is lacking, it only says SIMD are restricted to powers of 2, which I compy with...

alias S: Int = 256
alias S2: Int = S*S
alias SquareMatrix = SIMD[size=S2]

def main():
    var sqms = List[SquareMatrix[DType.float32]]()
    for ind in range(2):  # using 1 instead of 2 gives no issues
        sqm = SquareMatrix[DType.float32]()
        sqms.append(sqm)
    print('success!')  # we never get here :(

alias S: Int = 256
alias S2: Int = S*S
alias SquareMatrix = SIMD[size=S2]

def main():
    var sqms = List[SquareMatrix[DType.float32]]()
    for ind in range(2):  # using 1 instead of 2 gives no issues
        sqm = SquareMatrix[DType.float32]()
        sqms.append(sqm)
    print('success!')  # we never get here :(

46 Replies

sora•7mo ago

Looks like a bug. Could you please file an issue on Github?

franchesoniOP•7mo ago

sure

sora•7mo ago

Thanks To be clear, I don’t think this code should work, it’s just maybe the compiler can fail more gracefully

Darkmatter•7mo ago

LLVM ERROR: SmallVector unable to grow. Requested capacity (18446744073709518848) is larger than maximum value for size type (4294967295)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.      Running pass 'Function Pass Manager' on module ''.
1.      Running pass 'X86 DAG->DAG Instruction Selection' on function '@main'
Segmentation fault (core dumped)

LLVM ERROR: SmallVector unable to grow. Requested capacity (18446744073709518848) is larger than maximum value for size type (4294967295)
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.      Running pass 'Function Pass Manager' on module ''.
1.      Running pass 'X86 DAG->DAG Instruction Selection' on function '@main'
Segmentation fault (core dumped)

It looks like it's trying to constant fold the loop.

franchesoniOP•7mo ago

Thanks for the responsiveness! it isn't clear to me from the docs why the memory is not enough for SIMD (while for a pointer it is enough)

Darkmatter•7mo ago

This is blowing up at compile time, not runtime. My guess is that a bunch of copies of the matrices are being made as part of optimizations, and you're overflowing some limit somewhere.

franchesoniOP•7mo ago

so it wasn't my fault? surprising...

Darkmatter•7mo ago

This also has the issue:

def main():
    print(SIMD[DType.float32, 65536]())

def main():
    print(SIMD[DType.float32, 65536]())

Even with the optimizer turned off.

sora•7mo ago

I don’t think people should use simd like this though

Darkmatter•7mo ago

It doesn't matter, the compiler is blowing up in a bad way here. The requested size is 0xffffffffffff8000

sora•7mo ago

It does. Compiler passes have certain assumptions made wrt the thing it tries to do Or put it another way, a upper limit should be set

franchesoniOP•7mo ago

this makes sense, but if the docs don't set what the intended purpose of SIMD is, you'll get more people like me

Darkmatter•7mo ago

If I run that via 2's complement, it comes to 32768 (0x8000), which is a reasonable arena size. SIMD should be able to at least kind-of handle. Some of this is on Modular for advertising SIMD as equivalent to np.array.

sora•7mo ago

If it generates bad code, then no but SIMD is not array it suppose to be a type you can roughly pass by reg, which chris said somewhere here

Darkmatter•7mo ago

I mean it shouldn't crash, not that it should generate optimal code for this. Probably an "oversized SIMD" lint is in order.

sora•7mo ago

I mean it shouldn't even try to generate code for this. the passes that give you the most performance are usually also the passes that have very very bad asymptotic behaviour. So certain artifitial bound needs to be put on the problem size My guess is just the compiler engineer just got lazy and roll with though assumptions Kinda like how they handled UB Another example: our InlineArray type will always unroll the v in a check the compile time will blow up and possibly end up in OOM or something similar

Darkmatter•7mo ago

Really what needs to happen is for @register_passable to let you specify some formula for when it isn't any more, such as @register_passable(< simdbitwidth() * 4).

sora•7mo ago

We still did it because we don't know how to control unroll factor in the stdlib

Darkmatter•7mo ago

There is a point past which you blow your icache and unrolling is no longer wanted.

sora•7mo ago

I'm not even talking about the quality of generated code, just the fact that we are trying to generate code in this scenario is already bad I think the simd problem first and foremost is this case. We won't reach the point to even talk about generated code, the IR is already too big to process

Darkmatter•7mo ago

We effectively want an "@unroll_count()". If you have enough microarchitectural information in sys.info, you can get the rest from there.

sora•7mo ago

To help with the InlineArray problem, yes. But for SIMD, i don't know what's needed. Any pass can be the place where the bloat happns

Darkmatter•7mo ago

What I'm considering is that, given branch predictors are generally pretty good, there aren't many reasons to unroll more than 16x the number of vector ALUs in a CPU.

sora•7mo ago

What I know is that we used to have @unroll(n) where n is a unroll factor

Darkmatter•7mo ago

For a BIG cpu, I don't see more than 4 vector ALUs happening. I think that mapped to unroll_count inside of LLVM. You can always do it manually, but it's not great. And makes the code messy.

sora•7mo ago

That's very fair. And I was confused in the beginning as well. Would you please also mention that in the issue? Thanks a lot!

Darkmatter•7mo ago

However, the fact that large SIMD instances run out of memory even without loops is an issue.

sora•7mo ago

Which is exactly why we moved away from it, IIUC. such a count is a hint to LLVM, but i think they want to do it entirely or almost entirely deterministically in Mojo itself. I think you can find recode of this in the change log

Darkmatter•7mo ago

That is perfectly good. But, missing the feature is not great.

sora•7mo ago

We will have that back some day in the future I believe

Darkmatter•7mo ago

I think Mojo needs something like FlexSIMD which doesn't have the "power of 2" restriction and uses kernels that self-tune based on compile-time info to process in blocks. That can get treated like numpy arrays instead of mapping directly to hardware.

sora•7mo ago

I don't think that's supported by LLVM though? SIMD lowers to <n x type> which is incapable of representing that

franchesoniOP•7mo ago

an equivalent to np ndarray would be very nice, ndbuffer is not too intuitive

Darkmatter•7mo ago

FlexSIMD would be drain loops. It's effectively going to be an InlineArray[N, Scalar[dtype]], aligned to whatever is needed by the platform, Then you write drain loops for everything. This means no support for some of the more exotic operations like Galois Field Affine Transformations, but the basics should work fine.

sora•7mo ago

I was talking about the LLVM type. Just checked and my claim was wrong though, llvm seems to support <vscale x n x i32> types where vscale is not burned into the IR

Darkmatter•7mo ago

iirc vscale is part of the architecture. That's the vector width component of SVE and RSICVV.

franchesoniOP•7mo ago

https://discord.com/channels/1087530497313357884/1296578473493659688 I'll drop this new question here

sora•7mo ago

sora•7mo ago

Look like it's "just a number"

Darkmatter•7mo ago

Is arm vscale_range(1, 16)?

sora•7mo ago

Darkmatter•7mo ago

That looks a lot like information from the vector size register that ARM and RISCV have.

sora•7mo ago

yea, since it has to lower to something concrete after ISel But having this type representable gives me hope that we can support it in Mojo without much fuss

Darkmatter•7mo ago

That is good, and something I've been after, but that still won't handle the "I have a massive array I want to do operations with" case. If Mojo continue to advertise SIMD = np.array, people will expect to load large columnar datasets into a SIMD.

sora•7mo ago

Yea, we should improve the doc

Firas•6mo ago

I am not sure about my answer but this what I think: one variable of SIMD is like a hardware vector that you can use to do single instruction on multiple data, and the maximum value for one SIMD is upper bounded. you can use check the value for your architecture using: from sys.info import simdwidthof alias simd_width = simdwidthofDType.int32 reagrding your code: yes we can do somthing like SIMDtype=DType.float32, size=size. with size>>simd_width, but I think it's somthing that mojo handle without we know. "abstraction" regarding your Matrix: you can look at the matrix_mul_mojo_example from there webiste and look how they utilize the SIMD using the vectorize built in optimization. also look at the parallelize to multiple processors....

Gaming

Programming

List of SIMD bug

Did you find this page helpful?