Loading...

Type: Enhancement
Resolution: Unresolved
Priority: P4
Fix Version/s: tbd
Affects Version/s: 22
Component/s: hotspot
Labels:

Subcomponent:
compiler

Here a list of RFEs and BUGs related to SuperWord / AutoVectorization

You can also refer to my visual represenatation here:
https://eme64.github.io/blog/2025/01/01/AutoVectorization-Status.html

------------------------------------ FIXED BUGS ---------------------------------------------------------------------

~~JDK-8332905~~: C2 SuperWord: bad AD file, with RotateRightV and first operand not a pack
~~JDK-8330819~~: C2 SuperWord: bad dominance after pre-loop limit adjustment with base that has CastLL after pre-loop
~~JDK-8316679~~: C2 SuperWord: wrong result, load should not be moved before store if not comparable
~~JDK-8316594~~: C2 SuperWord: wrong result with hand unrolled loops
~~JDK-8310190~~: C2 SuperWord: AlignVector is broken, generates misaligned packs
(~~JDK-8311586~~, ~~JDK-8309662~~, ~~JDK-8303827~~)
~~JDK-8314612~~: TestUnorderedReduction.java fails with -XX:MaxVectorSize=32 and -XX:+AlignVector
~~JDK-8313720~~: C2 SuperWord: wrong result with -XX:+UseVectorCmov -XX:+UseCMoveUnconditionally
~~JDK-8306302~~: C2 Superword fix: use VectorMaskCmp and VectorBlend instead of CMoveVF/D
~~JDK-8298935~~: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs
~~JDK-8310130~~: C2: assert(false) failed: scalar_input is neither phi nor a matchin reduction
~~JDK-8309268~~: C2: "assert(in_bb(n)) failed: must be" after ~~JDK-8306302~~
~~JDK-8304720~~: SuperWord::schedule should rebuild C2-graph from SuperWord dependency-graph
~~JDK-8304042~~: C2 SuperWord: schedule must remove packs with cyclic dependencies
~~JDK-8340010~~: Fix vectorization tests with compact headers
~~JDK-8334431~~: C2 SuperWord: fix performance regression due to store-to-load-forwarding failures

------------------------------------ TODO BUGS ---------------------------------------------------------------------

~~JDK-8323582~~: C2 SuperWord AlignVector: misaligned vector memory access with Unsafe.allocateMemory
(working on it. will give us the infrastructure for Aliasing-Analysis)

------------------------------------ PROBYBLY NEVER TO BE FIXED ------------------------------------------

Lilliput collateral damage:
JDK-8344424: C2 SuperWord: mixed type loops do not vectorize with UseCompactObjectHeaders and AlignVector

------------------------------------ COMPLETED IMPROVEMENTS -----------------------------------------------
~~JDK-8317572~~: C2 SuperWord: refactor/improve VectorizeDebugOption and TraceSuperWord
~~JDK-8309267~~: C2 SuperWord: some tests fail on KNL machines - fail to vectorize
~~JDK-8302652~~: [SuperWord] Reduction should happen after loop, when possible
~~JDK-8308606~~: C2 SuperWord: remove alignment checks when not required
~~JDK-8308917~~: C2 SuperWord::output: assert before bailout with CountedLoopReserveKit
~~JDK-8260943~~: C2 SuperWord: Remove dead vectorization optimization added by 8076284
~~JDK-8318703~~: C2 SuperWord: take reduction nodes into account in early unrolling analysis

~~JDK-8325155~~: C2 SuperWord: remove alignment boundaries
~~JDK-8325541~~: C2 SuperWord: refactor filter / split
~~JDK-8326139~~: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence)
~~JDK-8332163~~: C2 SuperWord: refactor PacksetGraph and SuperWord::output into VTransformGraph

Cleanup:
~~JDK-8309204~~: Obsolete DoReserveCopyInSuperWord
~~JDK-8323577~~ C2 SuperWord: remove AlignVector restrictions on IR tests added in ~~JDK-8305055~~
~~JDK-8325159~~: C2 SuperWord: measure time for CITime
~~JDK-8335628~~: C2 SuperWord: cleanup: remove SuperWord::longer_type_for_conversion

Testing / Benchmarking:
~~JDK-8329273~~: C2 SuperWord: some basic MemorySegment IR tests
~~JDK-8333647~~: C2 SuperWord: some additional PopulateIndex tests
~~JDK-8310308~~: IR Framework: check for type and size of vector nodes
~~JDK-8340272~~: C2 SuperWord: JMH benchmark for Reduction vectorization
~~JDK-8344118~~: C2 SuperWord: add VectorThroughputForIterationCount benchmark
~~JDK-8342387~~: C2 SuperWord: refactor and improve compiler/loopopts/superword/TestDependencyOffsets.java
JDK-8347545: C2 SuperWord: AutoVectorization benchmark to motivate future work

------------------------------------ TODO VARIOUS IMPROVEMENTS ----------------------------------------------------------

JDK-8309908: C2 SuperWord: IGVN commute swap_edges can prevent vectorization
JDK-8308841: C2 SuperWord: implement vectorization of integer CMove
JDK-8303113: [SuperWord] investigate if enabling _do_vector_loop by default creates speedup
~~JDK-8307513~~: C2: intrinsify Math.max(long,long) and Math.min(long,long)
JDK-8299808: ArrayFill should be preferred over unrolling
JDK-8332878: C2 SuperWord: improve PopulateIndex detection for L/F/D

JDK-8342095: Add autovectorizer support for subword vector casts

JDK-8307084: C2: Vector atomic post loop is not executed for some small trip counts
(Found by ARM, I hope they take this one up soon!)

JDK-8344085: C2 SuperWord: improve vectorization for small loop iteration count

JDK-8328678: C2: hand unrolled loops don't vectorize/unroll as well as loops unrolled by the compiler

Reductions
JDK-8343597: C2 SuperWord: RelaxedMath for faster float reductions
JDK-8345044: Sum of array elements not vectorized
(should be addressed by cost-model, see other comments below)
JDK-8345107: C2 SuperWord: implement polynomial reductions (for hashing)
More ideas: generalize to prefix-sum, scans, and even segmented scans. Probably this requires a cost-model. And maybe some prior transformations on the scalar graph?
JDK-8345245: C2 SuperWord: further improve latency after PhaseIdealLoop::move_unordered_reduction_out_of_loop
JDK-8345549: C2 SuperWord: prefix-sum
JDK-8255030: Vectorize equality comparison of some inline types: Even if the issue is about inline types, it can be applicable to other types as well (e.g. record Quadrilateral(int xA, int yA, int xB, int yB, int xC, int yC, int xD, int yD)). Inline types make objects flatter, expand the applicability of this (e.g. record Quadrilateral(Point! A, Point! B, Point! C, Point! D))

Tests:
JDK-8310891: C2 SuperWord tests: move platform requirements to IR rules
JDK-8310523: Add IR tests for nodes that have too few IR tests yet
~~JDK-8327671~~: C2 SuperWord: move all tests to test/hotspot/jtreg/compiler/autovectorization

IR Framework:
JDK-8320224: IR Framework: add MaxVectorSize to JTREG_WHITELIST_FLAGS
JDK-8309183: [IR Framework] Add UseKNLSetting to whitelist
JDK-8310533: [IR Framework] Add possibility to automatically verify that a test method always returns the same result

More Testing infrastructure:
~~JDK-8346106~~: Verify.checkEQ: testing utility for recursive value verification
~~JDK-8346107~~: Generators: testing utility for random value generation
~~JDK-8344942~~: Template-Based Testing Framework

------------------------------------ TODO MemorySegment ------------------------------------------------------------------

~~JDK-8330991~~: C2 SuperWord: refactor VPointer
JDK-8331576: C2 SuperWord: Unsafe access with long address that is a CastX2P does not vectorize

I'm first working on a more general MemPointer, which can also be used outside of loopopts.
My first target is the MergeMem optimization: ~~JDK-8335392~~: C2 MergeStores: enhanced pointer parsing

JDK-8327209: C2 MemorySegment: missing RCE and vectorization
JDK-8324751: C2 SuperWord: Aliasing Analysis
JDK-8329077: C2: MemorySegment double accesses don't vectorize
~~JDK-8330274~~: C2 SuperWord: VPointer invar: same sum with different addition order should be equal
~~JDK-8331659~~: C2 SuperWord: investicate failed vectorization in compiler/loopopts/superword/TestMemorySegment.java
JDK-8343536: C2 SuperWord / MergeStores: investigate missing optimizations in MemorySegment examples

------------------------------------ TODO COST MODELING ------------------------------------------------------------------
JDK-8340093: C2 SuperWord: implement cost model
Systematically estimate the cost of the scalar vs vector loop.
This would be a better profitability heuristic than what we have now.
It would make it easier to estimate if reductions are profitable.
And it would allow us to estimate if vectorization is profitable with shuffles / insert / extract nodes,
which are additional operations: is their extra work outweighed by the vectorization gains?

Below some issues that are related to cost-modeling:
JDK-8307516: C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction
(goal: replace heuristics with cost-model)

JDK-8336000: Long::bitCount does not auto-vectorize on AArch64
(actually reports issue with 2-element reductions, they are marked as not protitable in SuperWord::implemented, must be re-evaluated)

https://www.elastic.co/search-labs/blog/articles/Vector%20Similarity%20Computations%20-%20ludicrous%20speed
Can we do this with auto-vectorization?
The embarassing thing here is: even a simple dot-product did not vectorize (example with bytes)

JDK-8305717: SuperWord: Vectorization in opposite direction traversal cases
JDK-8305707: SuperWord should vectorize reverse-order reduction loops
(requires shuffles, and maybe reverse-order reductions in the backend?)
-----------------------------------------------------------------------------------------------------

BIG GOAL

JDK-8347116: C2 SuperWord: If-Conversion

------------------------------------ VALHALLA ------------------------------------------------------------------
JDK-8253160: C2's superword optimization should vectorize flat inline type array accesses

relates to

JDK-8325497 Investigate C2 issues identified by the "JVM Performance Comparison for JDK 21"

Open

Details

Description

Attachments

Issue Links

Activity

People

Dates