Unnecessary nan-checks: performance issue or missing compile options.
I'm not sure whether this is a performance issue or a feature request. I figured lets ask here first.
The issue is a performance regression due to unnecessary nan-check for with (eg.) max and min operations.
+298 and +306 load data0 and data1 +314 calculates the maximum of zmm0 and zmm2 and store the result in zmm1 . +320 mask register k1 is set when zmm0 (data0) contains nan-values. +327 the result value (zmm1) is overwritten when the zmm0 was a nan with the value of data1 (zmm2) +333 result value is written back to memory If data0 could contain nan-values, the above assembly would be correct. But when data0 does not have such values, the code has a performance regression, because for every float min/max operations a nan-check is performed. This is something I would like to control in HPC AI workloads. Q: Is this a regression bug or something else (for which i need to make a feature request)?
+298 and +306 load data0 and data1 +314 calculates the maximum of zmm0 and zmm2 and store the result in zmm1 . +320 mask register k1 is set when zmm0 (data0) contains nan-values. +327 the result value (zmm1) is overwritten when the zmm0 was a nan with the value of data1 (zmm2) +333 result value is written back to memory If data0 could contain nan-values, the above assembly would be correct. But when data0 does not have such values, the code has a performance regression, because for every float min/max operations a nan-check is performed. This is something I would like to control in HPC AI workloads. Q: Is this a regression bug or something else (for which i need to make a feature request)?
5 Replies
Did some more digging:
is equivalent to: (i.c., generates the exact same assembly as)
The semantics of
llvm.maxnum
: dictate the observed nan behaviour:
Q: how to tell mojo that the parameters of max are larger than zero, (thus non NAN), or how to translate the following llvm into intrinsics:
llvm.umax?
That's for integers.
your right, my mistake
Did some more research and tried to solve the issue in MLIR by adding a 'fast' flag.
This works, but only when the data (d0 and d1) has constant values.
Does anyone know how to get the above code to work?
the first two binds somehow yield null values: