Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8344424

C2 SuperWord: mixed type loops do not vectorize with UseCompactObjectHeaders and AlignVector

XMLWordPrintable

      I'm filing this as a bug, not an RFE, because it would be a possible performance regression with UseCompactObjectHeaders, were it to leave experimental status or become default. This regression would only affect machines that require strict alignment (see AlignVector and Matcher::misaligned_vectors_ok).

      ------------------------------------------------------------------------------

      JDK-8305895 added UseCompactObjectHeaders, which changed the offset from object base to array payload:

      -XX:-UseCompactObjectHeaders
      UNSAFE.ARRAY_BYTE_BASE_OFFSET = 16
      UNSAFE.ARRAY_SHORT_BASE_OFFSET = 16
      UNSAFE.ARRAY_CHAR_BASE_OFFSET = 16
      UNSAFE.ARRAY_INT_BASE_OFFSET = 16
      UNSAFE.ARRAY_LONG_BASE_OFFSET = 16
      UNSAFE.ARRAY_FLOAT_BASE_OFFSET = 16
      UNSAFE.ARRAY_DOUBLE_BASE_OFFSET = 16

      -XX:+UseCompactObjectHeaders
      UNSAFE.ARRAY_BYTE_BASE_OFFSET = 12
      UNSAFE.ARRAY_SHORT_BASE_OFFSET = 12
      UNSAFE.ARRAY_CHAR_BASE_OFFSET = 12
      UNSAFE.ARRAY_INT_BASE_OFFSET = 12
      UNSAFE.ARRAY_LONG_BASE_OFFSET = 16
      UNSAFE.ARRAY_FLOAT_BASE_OFFSET = 12
      UNSAFE.ARRAY_DOUBLE_BASE_OFFSET = 16

      ---------------------------------------------------------------------

      And under platforms that require strict alignment, we require 8-byte alignment for all vector loads/stores. One might think that full vector-width is required, but it turns out 8-byte is sufficient. Relevant code:

      src/hotspot/share/opto/vectorization.hpp: static bool vectors_should_be_aligned() { return !Matcher::misaligned_vectors_ok() || AlignVector; }



      src/hotspot/cpu/x86/matcher_x86.hpp: static constexpr bool misaligned_vectors_ok() {
        // x86 supports misaligned vectors store/load.
        static constexpr bool misaligned_vectors_ok() {
          return true;
        }


      src/hotspot/cpu/ppc/matcher_ppc.hpp: static constexpr bool misaligned_vectors_ok() {
        // PPC implementation uses VSX load/store instructions (if
        // SuperwordUseVSX) which support 4 byte but not arbitrary alignment
        static constexpr bool misaligned_vectors_ok() {
          return false;
        }

      src/hotspot/cpu/aarch64/matcher_aarch64.hpp: static constexpr bool misaligned_vectors_ok() {
        // aarch64 supports misaligned vectors store/load.
        static constexpr bool misaligned_vectors_ok() {
          return true;
        }

      src/hotspot/cpu/s390/matcher_s390.hpp: static constexpr bool misaligned_vectors_ok() {
        // z/Architecture does support misaligned store/load at minimal extra cost.
        static constexpr bool misaligned_vectors_ok() {
          return true;
        }

      src/hotspot/cpu/arm/matcher_arm.hpp: static constexpr bool misaligned_vectors_ok() {
        // ARM doesn't support misaligned vectors store/load.
        static constexpr bool misaligned_vectors_ok() {
          return false;
        }

      src/hotspot/cpu/riscv/matcher_riscv.hpp: static constexpr bool misaligned_vectors_ok() {
        // riscv supports misaligned vectors store/load.
        static constexpr bool misaligned_vectors_ok() {
          return true;
        }

      And there are some exceptions, for example on aarch64 and x86:

      x86:
      src/hotspot/cpu/x86/vm_version_x86.cpp: AlignVector = !UseUnalignedLoadStores;

            if (supports_sse4_2()) { // new ZX cpus
              if (FLAG_IS_DEFAULT(UseUnalignedLoadStores)) {
                UseUnalignedLoadStores = true; // use movdqu on newest ZX cpus
              }
            }
      So I suppose some older platforms may be affected, though I have not seen one yet. They would have to be missing the unaligned movdqu instructions.

      aarch64:
      src/hotspot/cpu/aarch64/vm_version_aarch64.cpp: AlignVector = AvoidUnalignedAccesses;

        // Ampere eMAG
        if (_cpu == CPU_AMCC && (_model == CPU_MODEL_EMAG) && (_variant == 0x3)) {
          if (FLAG_IS_DEFAULT(AvoidUnalignedAccesses)) {
            FLAG_SET_DEFAULT(AvoidUnalignedAccesses, true);
          }
      and

        // ThunderX
        if (_cpu == CPU_CAVIUM && (_model == 0xA1)) {
          guarantee(_variant != 0, "Pre-release hardware no longer supported.");
          if (FLAG_IS_DEFAULT(AvoidUnalignedAccesses)) {
            FLAG_SET_DEFAULT(AvoidUnalignedAccesses, true);
          }
      and

        // ThunderX2
        if ((_cpu == CPU_CAVIUM && (_model == 0xAF)) ||
            (_cpu == CPU_BROADCOM && (_model == 0x516))) {
          if (FLAG_IS_DEFAULT(AvoidUnalignedAccesses)) {
            FLAG_SET_DEFAULT(AvoidUnalignedAccesses, true);
          }
      and

        // HiSilicon TSV110
        if (_cpu == CPU_HISILICON && _model == 0xd01) {
          if (FLAG_IS_DEFAULT(AvoidUnalignedAccesses)) {
            FLAG_SET_DEFAULT(AvoidUnalignedAccesses, true);
          }

      --------------------------------------------------------------

      If we do not require strict alignment, then we can use unaligned memory accesses, such as vmovdqu.

      With strict alignment requirement (i.e. 8-byte alignment) / AlignVector, we need to make sure that all vector load/store have their address:
      adr % 8 = 0

      Of course all object bases are aligned with ObjectAlignmentInBytes = 8.

      ---------------------------

      Now let's try to get that 8-byte alignment in some example:

          public short[] convertFloatToShort() {
              short[] res = new short[SIZE];
              for (int i = 0; i < SIZE; i++) {
                  res[i] = (short) floats[i];
              }
              return res;
          }

      Let's look at the two addresses with UseCompactObjectHeaders=false, where we can vectorize:

      F_adr = base + 16 + 4 * i
      -> aligned for: i % 2 = 0
      S_adr = base + 16 + 2 * i
      -> aligned for: i % 4 = 0

      -> solution for both: i % 4 = 0, i.e. we have alignment for both vector accesses every 4th iteration.


      Let's look at the two addresses with UseCompactObjectHeaders=true, where we cannot vectorize:

      F_adr = base + 12 + 4 * i
      -> aligned for: i % 2 = 1
      S_adr = base + 12 + 2 * i
      -> aligned for: i % 4 = 2

      -> There is no solution to satisfy both alignment constraints!

      ----------------------------

      Of course this is not strictly due to UseCompactObjectHeaders, there are other flags that affect the distance-to-payload, such as UseCompressedClassPointers, which everyone has enabled now, I think. But the question is if we are ok with the changes to enabling UseCompactObjectHeaders, which will mean that some mixed type (e.g. conversion) loops cannot vectorize, due to impossible alignment constraints.

      -------------------------------------

      If you are more interested in how we currently compute the alignment solution for AlignVector, please see:
      JDK-8310190
      https://github.com/openjdk/jdk/pull/14785

            Unassigned Unassigned
            epeter Emanuel Peter
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: