in-place Relu operation on a matrix struct in memory
What would be the optimal portable way to code in mojo the Relu operation on a matrix struct sitting in memory (eg the struct you use in your matmul example), noting that tpu and gpu and cpu may or may not have relu arithmetic blocks on-chip
1 Reply
I'll answer my own question and see whether anybody can simply confirm it is indeed the way to go. Checking cuda code behind Relu(x) for some popular ML frameworks seems to always lead to fmaxf(x,0) which ofc is simply what Relu is mathematically. So probably there is no magic dialect implementation that is going to be faster than this math op on dedicated hw such as TPU and I simply need to do a vectorized SIMD max(x,0) in mojo too.