Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8317424

C2 SuperWord Umbrella: improvements

XMLWordPrintable

      Here a list of RFEs and BUGs related to SuperWord / AutoVectorization

      ------------------------------------ FIXED BUGS ---------------------------------------------------------------------

      JDK-8332905: C2 SuperWord: bad AD file, with RotateRightV and first operand not a pack
      JDK-8330819: C2 SuperWord: bad dominance after pre-loop limit adjustment with base that has CastLL after pre-loop
      JDK-8316679: C2 SuperWord: wrong result, load should not be moved before store if not comparable
      JDK-8316594: C2 SuperWord: wrong result with hand unrolled loops
      JDK-8310190: C2 SuperWord: AlignVector is broken, generates misaligned packs
      (JDK-8311586, JDK-8309662, JDK-8303827)
      JDK-8314612: TestUnorderedReduction.java fails with -XX:MaxVectorSize=32 and -XX:+AlignVector
      JDK-8313720: C2 SuperWord: wrong result with -XX:+UseVectorCmov -XX:+UseCMoveUnconditionally
      JDK-8306302: C2 Superword fix: use VectorMaskCmp and VectorBlend instead of CMoveVF/D
      JDK-8298935: fix independence bug in create_pack logic in SuperWord::find_adjacent_refs
      JDK-8310130: C2: assert(false) failed: scalar_input is neither phi nor a matchin reduction
      JDK-8309268: C2: "assert(in_bb(n)) failed: must be" after JDK-8306302
      JDK-8304720: SuperWord::schedule should rebuild C2-graph from SuperWord dependency-graph
      JDK-8304042: C2 SuperWord: schedule must remove packs with cyclic dependencies
      JDK-8340010: Fix vectorization tests with compact headers
      JDK-8334431: C2 SuperWord: fix performance regression due to store-to-load-forwarding failures

      ------------------------------------ TODO BUGS ---------------------------------------------------------------------

      JDK-8323582: C2 SuperWord AlignVector: misaligned vector memory access with Unsafe.allocateMemory
      (working on it. will give us the infrastructure for Aliasing-Analysis)

      ------------------------------------ PROBYBLY NEVER TO BE FIXED ------------------------------------------

      Lilliput collateral damage:
      JDK-8344424: C2 SuperWord: mixed type loops do not vectorize with UseCompactObjectHeaders and AlignVector

      ------------------------------------ COMPLETED IMPROVEMENTS -----------------------------------------------
      JDK-8317572: C2 SuperWord: refactor/improve VectorizeDebugOption and TraceSuperWord
      JDK-8309267: C2 SuperWord: some tests fail on KNL machines - fail to vectorize
      JDK-8302652: [SuperWord] Reduction should happen after loop, when possible
      JDK-8308606: C2 SuperWord: remove alignment checks when not required
      JDK-8308917: C2 SuperWord::output: assert before bailout with CountedLoopReserveKit
      JDK-8260943: C2 SuperWord: Remove dead vectorization optimization added by 8076284
      JDK-8318703: C2 SuperWord: take reduction nodes into account in early unrolling analysis

      JDK-8325155: C2 SuperWord: remove alignment boundaries
      JDK-8325541: C2 SuperWord: refactor filter / split
      JDK-8326139: C2 SuperWord: split packs (match use/def packs, implemented, mutual independence)
      JDK-8332163: C2 SuperWord: refactor PacksetGraph and SuperWord::output into VTransformGraph

      Cleanup:
      JDK-8309204: Obsolete DoReserveCopyInSuperWord
      JDK-8323577 C2 SuperWord: remove AlignVector restrictions on IR tests added in JDK-8305055
      JDK-8325159: C2 SuperWord: measure time for CITime
      JDK-8335628: C2 SuperWord: cleanup: remove SuperWord::longer_type_for_conversion

      Testing / Benchmarking:
      JDK-8329273: C2 SuperWord: some basic MemorySegment IR tests
      JDK-8333647: C2 SuperWord: some additional PopulateIndex tests
      JDK-8310308: IR Framework: check for type and size of vector nodes
      JDK-8340272: C2 SuperWord: JMH benchmark for Reduction vectorization
      JDK-8344118: C2 SuperWord: add VectorThroughputForIterationCount benchmark
      JDK-8342387: C2 SuperWord: refactor and improve compiler/loopopts/superword/TestDependencyOffsets.java

      ------------------------------------ TODO VARIOUS IMPROVEMENTS ----------------------------------------------------------

      JDK-8309908: C2 SuperWord: IGVN commute swap_edges can prevent vectorization
      JDK-8308841: C2 SuperWord: implement vectorization of integer CMove
      JDK-8303113: [SuperWord] investigate if enabling _do_vector_loop by default creates speedup
      JDK-8307513: C2: intrinsify Math.max(long,long) and Math.min(long,long)
      JDK-8299808: ArrayFill should be preferred over unrolling
      JDK-8332878: C2 SuperWord: improve PopulateIndex detection for L/F/D

      JDK-8342095: Add autovectorizer support for subword vector casts

      JDK-8307084: C2: Vector atomic post loop is not executed for some small trip counts
      (Found by ARM, I hope they take this one up soon!)

      JDK-8344085: C2 SuperWord: improve vectorization for small loop iteration count


      JDK-8328678: C2: hand unrolled loops don't vectorize/unroll as well as loops unrolled by the compiler

      Reductions
      JDK-8343597: C2 SuperWord: RelaxedMath for faster float reductions
      JDK-8345044: Sum of array elements not vectorized
      (should be addressed by cost-model, see other comments below)
      JDK-8345107: C2 SuperWord: implement polynomial reductions (for hashing)
      More ideas: generalize to prefix-sum, scans, and even segmented scans. Probably this requires a cost-model. And maybe some prior transformations on the scalar graph?
      JDK-8345245: C2 SuperWord: further improve latency after PhaseIdealLoop::move_unordered_reduction_out_of_loop
      JDK-8345549: C2 SuperWord: prefix-sum

      Tests:
      JDK-8310891: C2 SuperWord tests: move platform requirements to IR rules
      JDK-8310523: Add IR tests for nodes that have too few IR tests yet
      JDK-8327671: C2 SuperWord: move all tests to test/hotspot/jtreg/compiler/autovectorization

      IR Framework:
      JDK-8320224: IR Framework: add MaxVectorSize to JTREG_WHITELIST_FLAGS
      JDK-8309183: [IR Framework] Add UseKNLSetting to whitelist
      JDK-8310533: [IR Framework] Add possibility to automatically verify that a test method always returns the same result

      More Testing infrastructure:
      JDK-8346106: Verify.checkEQ: testing utility for recursive value verification
      JDK-8346107: Generators: testing utility for random value generation
      JDK-8344942: Template-Based Testing Framework

      ------------------------------------ TODO MemorySegment ------------------------------------------------------------------

      JDK-8330991: C2 SuperWord: refactor VPointer
      JDK-8331576: C2 SuperWord: Unsafe access with long address that is a CastX2P does not vectorize

      I'm first working on a more general MemPointer, which can also be used outside of loopopts.
      My first target is the MergeMem optimization: JDK-8335392: C2 MergeStores: enhanced pointer parsing

      JDK-8327209: C2 MemorySegment: missing RCE and vectorization
      JDK-8324751: C2 SuperWord: Aliasing Analysis
      JDK-8329077: C2: MemorySegment double accesses don't vectorize
      JDK-8330274: C2 SuperWord: VPointer invar: same sum with different addition order should be equal
      JDK-8331659: C2 SuperWord: investicate failed vectorization in compiler/loopopts/superword/TestMemorySegment.java
      JDK-8343536: C2 SuperWord / MergeStores: investigate missing optimizations in MemorySegment examples

      ------------------------------------ TODO COST MODELING ------------------------------------------------------------------
      JDK-8340093: C2 SuperWord: implement cost model
      Systematically estimate the cost of the scalar vs vector loop.
      This would be a better profitability heuristic than what we have now.
      It would make it easier to estimate if reductions are profitable.
      And it would allow us to estimate if vectorization is profitable with shuffles / insert / extract nodes,
      which are additional operations: is their extra work outweighed by the vectorization gains?

      Below some issues that are related to cost-modeling:
      JDK-8307516: C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction
      (goal: replace heuristics with cost-model)

      JDK-8336000: Long::bitCount does not auto-vectorize on AArch64
      (actually reports issue with 2-element reductions, they are marked as not protitable in SuperWord::implemented, must be re-evaluated)

      https://www.elastic.co/search-labs/blog/articles/Vector%20Similarity%20Computations%20-%20ludicrous%20speed
      Can we do this with auto-vectorization?
      The embarassing thing here is: even a simple dot-product did not vectorize (example with bytes)


      JDK-8305717: SuperWord: Vectorization in opposite direction traversal cases
      JDK-8305707: SuperWord should vectorize reverse-order reduction loops
      (requires shuffles, and maybe reverse-order reductions in the backend?)
      -----------------------------------------------------------------------------------------------------

            epeter Emanuel Peter
            epeter Emanuel Peter
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: