-
Enhancement
-
Resolution: Unresolved
-
P4
-
22
Here a list of RFEs and BUGs related to SuperWord / AutoVectorization
------------------------------------ FIXED BUGS ---------------------------------------------------------------------
JDK-8332905: C2 SuperWord: bad AD file, with RotateRightV and first operand not a pack
JDK-8330819: C2 SuperWord: bad dominance after pre-loop limit adjustment with base that has CastLL after pre-loop
JDK-8316679: C2 SuperWord: wrong result, load should not be moved before store if not comparable
JDK-8316594: C2 SuperWord: wrong result with hand unrolled loops
JDK-8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs
(JDK-8311586, JDK-8309662, JDK-8303827)
JDK-8314612: TestUnorderedReduction.java fails with -XX:MaxVectorSize=32 and -XX:+AlignVector
JDK-8313720: C2 SuperWord: wrong result with -XX:+UseVectorCmov -XX:+UseCMoveUnconditionally
JDK-8306302: C2 Superword fix: use VectorMaskCmp and VectorBlend instead of CMoveVF/D
JDK-8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs
JDK-8310130: C2: assert(false) failed: scalar_input is neither phi nor a matchin reduction
JDK-8309268: C2: "assert(in_bb(n)) failed: must be" after JDK-8306302
JDK-8304720: SuperWord::schedule should rebuild C2-graph from SuperWord dependency-graph
JDK-8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies
JDK-8340010: Fix vectorization tests with compact headers
JDK-8334431: C2 SuperWord: fix performance regression due to store-to-load-forwarding failures
------------------------------------ TODO BUGS ---------------------------------------------------------------------
JDK-8323582: C2 SuperWord AlignVector: misaligned vector memory access with Unsafe.allocateMemory
(working on it. will give us the infrastructure for Aliasing-Analysis)
------------------------------------ PROBYBLY NEVER TO BE FIXED ------------------------------------------
Lilliput collateral damage:
JDK-8344424: C2 SuperWord: mixed type loops do not vectorize with UseCompactObjectHeaders and AlignVector
------------------------------------ COMPLETED IMPROVEMENTS -----------------------------------------------
JDK-8317572: C2 SuperWord: refactor/improve VectorizeDebugOption and TraceSuperWord
JDK-8309267: C2 SuperWord: some tests fail on KNL machines - fail to vectorize
JDK-8302652: [SuperWord] Reduction should happen after loop, when possible
JDK-8308606: C2 SuperWord: remove alignment checks when not required
JDK-8308917: C2 SuperWord::output: assert before bailout with CountedLoopReserveKit
JDK-8260943: C2 SuperWord: Remove dead vectorization optimization added by 8076284
JDK-8318703: C2 SuperWord: take reduction nodes into account in early unrolling analysis
JDK-8325155: C2 SuperWord: remove alignment boundaries
JDK-8325541: C2 SuperWord: refactor filter / split
JDK-8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence)
JDK-8332163: C2 SuperWord: refactor PacksetGraph and SuperWord::output into VTransformGraph
Cleanup:
JDK-8309204: Obsolete DoReserveCopyInSuperWord
JDK-8323577 C2 SuperWord: remove AlignVector restrictions on IR tests added in JDK-8305055
JDK-8325159: C2 SuperWord: measure time for CITime
JDK-8335628: C2 SuperWord: cleanup: remove SuperWord::longer_type_for_conversion
Testing / Benchmarking:
JDK-8329273: C2 SuperWord: some basic MemorySegment IR tests
JDK-8333647: C2 SuperWord: some additional PopulateIndex tests
JDK-8310308: IR Framework: check for type and size of vector nodes
JDK-8340272: C2 SuperWord: JMH benchmark for Reduction vectorization
JDK-8344118: C2 SuperWord: add VectorThroughputForIterationCount benchmark
JDK-8342387: C2 SuperWord: refactor and improve compiler/loopopts/superword/TestDependencyOffsets.java
------------------------------------ TODO VARIOUS IMPROVEMENTS ----------------------------------------------------------
JDK-8309908: C2 SuperWord: IGVN commute swap_edges can prevent vectorization
JDK-8308841: C2 SuperWord: implement vectorization of integer CMove
JDK-8303113: [SuperWord] investigate if enabling _do_vector_loop by default creates speedup
JDK-8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long)
JDK-8299808: ArrayFill should be preferred over unrolling
JDK-8332878: C2 SuperWord: improve PopulateIndex detection for L/F/D
JDK-8342095: Add autovectorizer support for subword vector casts
JDK-8307084: C2: Vector atomic post loop is not executed for some small trip counts
(Found by ARM, I hope they take this one up soon!)
JDK-8344085: C2 SuperWord: improve vectorization for small loop iteration count
JDK-8328678: C2: hand unrolled loops don't vectorize/unroll as well as loops unrolled by the compiler
Reductions
JDK-8343597: C2 SuperWord: RelaxedMath for faster float reductions
JDK-8345044: Sum of array elements not vectorized
(should be addressed by cost-model, see other comments below)
JDK-8345107: C2 SuperWord: implement polynomial reductions (for hashing)
More ideas: generalize to prefix-sum, scans, and even segmented scans. Probably this requires a cost-model. And maybe some prior transformations on the scalar graph?
JDK-8345245: C2 SuperWord: further improve latency after PhaseIdealLoop::move_unordered_reduction_out_of_loop
JDK-8345549: C2 SuperWord: prefix-sum
Tests:
JDK-8310891: C2 SuperWord tests: move platform requirements to IR rules
JDK-8310523: Add IR tests for nodes that have too few IR tests yet
JDK-8327671: C2 SuperWord: move all tests to test/hotspot/jtreg/compiler/autovectorization
IR Framework:
JDK-8320224: IR Framework: add MaxVectorSize to JTREG_WHITELIST_FLAGS
JDK-8309183: [IR Framework] Add UseKNLSetting to whitelist
JDK-8310533: [IR Framework] Add possibility to automatically verify that a test method always returns the same result
More Testing infrastructure:
JDK-8346106: Verify.checkEQ: testing utility for recursive value verification
JDK-8346107: Generators: testing utility for random value generation
JDK-8344942: Template-Based Testing Framework
------------------------------------ TODO MemorySegment ------------------------------------------------------------------
JDK-8330991: C2 SuperWord: refactor VPointer
JDK-8331576: C2 SuperWord: Unsafe access with long address that is a CastX2P does not vectorize
I'm first working on a more general MemPointer, which can also be used outside of loopopts.
My first target is the MergeMem optimization:JDK-8335392: C2 MergeStores: enhanced pointer parsing
JDK-8327209: C2 MemorySegment: missing RCE and vectorization
JDK-8324751: C2 SuperWord: Aliasing Analysis
JDK-8329077: C2: MemorySegment double accesses don't vectorize
JDK-8330274: C2 SuperWord: VPointer invar: same sum with different addition order should be equal
JDK-8331659: C2 SuperWord: investicate failed vectorization in compiler/loopopts/superword/TestMemorySegment.java
JDK-8343536: C2 SuperWord / MergeStores: investigate missing optimizations in MemorySegment examples
------------------------------------ TODO COST MODELING ------------------------------------------------------------------
JDK-8340093: C2 SuperWord: implement cost model
Systematically estimate the cost of the scalar vs vector loop.
This would be a better profitability heuristic than what we have now.
It would make it easier to estimate if reductions are profitable.
And it would allow us to estimate if vectorization is profitable with shuffles / insert / extract nodes,
which are additional operations: is their extra work outweighed by the vectorization gains?
Below some issues that are related to cost-modeling:
JDK-8307516: C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction
(goal: replace heuristics with cost-model)
JDK-8336000: Long::bitCount does not auto-vectorize on AArch64
(actually reports issue with 2-element reductions, they are marked as not protitable in SuperWord::implemented, must be re-evaluated)
https://www.elastic.co/search-labs/blog/articles/Vector%20Similarity%20Computations%20-%20ludicrous%20speed
Can we do this with auto-vectorization?
The embarassing thing here is: even a simple dot-product did not vectorize (example with bytes)
JDK-8305717: SuperWord: Vectorization in opposite direction traversal cases
JDK-8305707: SuperWord should vectorize reverse-order reduction loops
(requires shuffles, and maybe reverse-order reductions in the backend?)
-----------------------------------------------------------------------------------------------------
------------------------------------ FIXED BUGS ---------------------------------------------------------------------
(
------------------------------------ TODO BUGS ---------------------------------------------------------------------
JDK-8323582: C2 SuperWord AlignVector: misaligned vector memory access with Unsafe.allocateMemory
(working on it. will give us the infrastructure for Aliasing-Analysis)
------------------------------------ PROBYBLY NEVER TO BE FIXED ------------------------------------------
Lilliput collateral damage:
JDK-8344424: C2 SuperWord: mixed type loops do not vectorize with UseCompactObjectHeaders and AlignVector
------------------------------------ COMPLETED IMPROVEMENTS -----------------------------------------------
Cleanup:
Testing / Benchmarking:
------------------------------------ TODO VARIOUS IMPROVEMENTS ----------------------------------------------------------
JDK-8309908: C2 SuperWord: IGVN commute swap_edges can prevent vectorization
JDK-8308841: C2 SuperWord: implement vectorization of integer CMove
JDK-8303113: [SuperWord] investigate if enabling _do_vector_loop by default creates speedup
JDK-8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long)
JDK-8299808: ArrayFill should be preferred over unrolling
JDK-8332878: C2 SuperWord: improve PopulateIndex detection for L/F/D
JDK-8342095: Add autovectorizer support for subword vector casts
JDK-8307084: C2: Vector atomic post loop is not executed for some small trip counts
(Found by ARM, I hope they take this one up soon!)
JDK-8344085: C2 SuperWord: improve vectorization for small loop iteration count
JDK-8328678: C2: hand unrolled loops don't vectorize/unroll as well as loops unrolled by the compiler
Reductions
JDK-8343597: C2 SuperWord: RelaxedMath for faster float reductions
JDK-8345044: Sum of array elements not vectorized
(should be addressed by cost-model, see other comments below)
JDK-8345107: C2 SuperWord: implement polynomial reductions (for hashing)
More ideas: generalize to prefix-sum, scans, and even segmented scans. Probably this requires a cost-model. And maybe some prior transformations on the scalar graph?
JDK-8345245: C2 SuperWord: further improve latency after PhaseIdealLoop::move_unordered_reduction_out_of_loop
JDK-8345549: C2 SuperWord: prefix-sum
Tests:
JDK-8310891: C2 SuperWord tests: move platform requirements to IR rules
JDK-8310523: Add IR tests for nodes that have too few IR tests yet
JDK-8327671: C2 SuperWord: move all tests to test/hotspot/jtreg/compiler/autovectorization
IR Framework:
JDK-8320224: IR Framework: add MaxVectorSize to JTREG_WHITELIST_FLAGS
JDK-8309183: [IR Framework] Add UseKNLSetting to whitelist
JDK-8310533: [IR Framework] Add possibility to automatically verify that a test method always returns the same result
More Testing infrastructure:
JDK-8346107: Generators: testing utility for random value generation
JDK-8344942: Template-Based Testing Framework
------------------------------------ TODO MemorySegment ------------------------------------------------------------------
JDK-8331576: C2 SuperWord: Unsafe access with long address that is a CastX2P does not vectorize
I'm first working on a more general MemPointer, which can also be used outside of loopopts.
My first target is the MergeMem optimization:
JDK-8327209: C2 MemorySegment: missing RCE and vectorization
JDK-8324751: C2 SuperWord: Aliasing Analysis
JDK-8329077: C2: MemorySegment double accesses don't vectorize
JDK-8330274: C2 SuperWord: VPointer invar: same sum with different addition order should be equal
JDK-8331659: C2 SuperWord: investicate failed vectorization in compiler/loopopts/superword/TestMemorySegment.java
JDK-8343536: C2 SuperWord / MergeStores: investigate missing optimizations in MemorySegment examples
------------------------------------ TODO COST MODELING ------------------------------------------------------------------
JDK-8340093: C2 SuperWord: implement cost model
Systematically estimate the cost of the scalar vs vector loop.
This would be a better profitability heuristic than what we have now.
It would make it easier to estimate if reductions are profitable.
And it would allow us to estimate if vectorization is profitable with shuffles / insert / extract nodes,
which are additional operations: is their extra work outweighed by the vectorization gains?
Below some issues that are related to cost-modeling:
JDK-8307516: C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction
(goal: replace heuristics with cost-model)
JDK-8336000: Long::bitCount does not auto-vectorize on AArch64
(actually reports issue with 2-element reductions, they are marked as not protitable in SuperWord::implemented, must be re-evaluated)
https://www.elastic.co/search-labs/blog/articles/Vector%20Similarity%20Computations%20-%20ludicrous%20speed
Can we do this with auto-vectorization?
The embarassing thing here is: even a simple dot-product did not vectorize (example with bytes)
JDK-8305717: SuperWord: Vectorization in opposite direction traversal cases
JDK-8305707: SuperWord should vectorize reverse-order reduction loops
(requires shuffles, and maybe reverse-order reductions in the backend?)
-----------------------------------------------------------------------------------------------------
- relates to
-
JDK-8325497 Investigate C2 issues identified by the "JVM Performance Comparison for JDK 21"
- Open