-
Enhancement
-
Resolution: Unresolved
-
P4
-
24
For a while, I have been bothered by the fact that float-reductions cannot be vectorized. This is because they require strict order of reduction, which prevents parallelization - they must be added/multiplied sequentially - otherwise there can be different rounding errors.
With [~jrose] and [~darcy] we have been discussing how to best allow faster reductions for floats/doubles. I just want to allow faster reductions, for example for a fast sum or dot-product. My approach is from a HPC/ML background where the exact precision of floats is not super important. Joe's background here leaned more on the side of reproducability: by default a sum should always return the exact same value, and not different values depending on if the compiler decided to optimize of not. Hence we agreed on this plan for now:
The 3 levels of work:
- Internal class "RelaxedMath" , with static methods. Optimizations on VM level that exploint their relaxed semantics. The semantics goes across all "similar" ops, so that we can reorder sums/reductions. Maybe we also experiment with a version that allows combining add and mul into fma.
- Public API extensions to Collector and maybe Array. These are easier to write a clean spec for (sum with arbitrary reordering of inputs).
- Application in project Babylon: allow expression transformation of regular float-ops to relaxed float-ops for speedup at the price of reproducability of rounding errors.
My goal now is the first step:
- Introduce the class "RelaxedMath"
- Define methods like "RelaxedMath.add(float, float)"
- Intrinsify these methods: i.e. capture them as special IR nodes (e.g. RelaxedAddF).
- Optmize based on relaxed semantics (first just for SuperWord/AutoVectorization non-strict reductions)
- Lower any scalar relaxed ops to strict ops -> use the same backend operations.
With [~jrose] and [~darcy] we have been discussing how to best allow faster reductions for floats/doubles. I just want to allow faster reductions, for example for a fast sum or dot-product. My approach is from a HPC/ML background where the exact precision of floats is not super important. Joe's background here leaned more on the side of reproducability: by default a sum should always return the exact same value, and not different values depending on if the compiler decided to optimize of not. Hence we agreed on this plan for now:
The 3 levels of work:
- Internal class "RelaxedMath" , with static methods. Optimizations on VM level that exploint their relaxed semantics. The semantics goes across all "similar" ops, so that we can reorder sums/reductions. Maybe we also experiment with a version that allows combining add and mul into fma.
- Public API extensions to Collector and maybe Array. These are easier to write a clean spec for (sum with arbitrary reordering of inputs).
- Application in project Babylon: allow expression transformation of regular float-ops to relaxed float-ops for speedup at the price of reproducability of rounding errors.
My goal now is the first step:
- Introduce the class "RelaxedMath"
- Define methods like "RelaxedMath.add(float, float)"
- Intrinsify these methods: i.e. capture them as special IR nodes (e.g. RelaxedAddF).
- Optmize based on relaxed semantics (first just for SuperWord/AutoVectorization non-strict reductions)
- Lower any scalar relaxed ops to strict ops -> use the same backend operations.