Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8343773

Superword/auto vectorization of fill pattern is slow on Aarch64

XMLWordPrintable

      In the class `jdk.internal.foreign.SegmentBulkOperations` there is a method `fill()`. Said method is manually written to use long -> int -> short -> byte operations to maximize unit size during segment traversal.

      It would be tempting to replace that method with something like this:

                  final int end = (int) dst.length;
                  // Rely on aligned auto vectorization
                  for (int i = 0 ; i < end; i++) {
                      SCOPED_MEMORY_ACCESS.putByte(dst.sessionImpl(), dst.unsafeGetBase(), dst.unsafeGetOffset() + i, value);
                  }

      The C2/Grall would then be able to generate even more optimized constructs such as using super-words and unrolling. However, at least on macOS M1, the C2 is slower or equal:


      @BenchmarkMode(Mode.AverageTime)
      @Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
      @Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
      @State(Scope.Thread)
      @OutputTimeUnit(TimeUnit.NANOSECONDS)
      @Fork(value = 3)
      public class SegmentBulk2Fill {

          @Param({"8", "32", "512", "2048", "32768"})
          public int ELEM_SIZE;

          byte[] array;
          MemorySegment heapSegment;
          MemorySegment nativeSegment;
          ByteBuffer buffer;

          @Setup
          public void setup() {
              array = new byte[ELEM_SIZE];
              heapSegment = MemorySegment.ofArray(array);
              nativeSegment = Arena.ofAuto().allocate(ELEM_SIZE, 8);
              buffer = ByteBuffer.wrap(array);
          }


          @Fork(value = 3, jvmArgs = {"-Djava.lang.foreign.native.threshold.power.fill=31"})
          @Benchmark
          public void heapSegmentFillJava() {
              heapSegment.fill((byte) 0);
          }

          @Fork(value = 3, jvmArgs = {"-Djava.lang.foreign.native.threshold.power.fill=31"})
          @Benchmark
          public void nativeSegmentFillJava() {
              nativeSegment.fill((byte) 0);
          }

      }

      $ make test TEST="micro:java.lang.foreign.SegmentBulk2Fill" MICRO="OPTIONS=-p ELEM_SIZE=65536"

      Base
      Benchmark (ELEM_SIZE) Mode Cnt Score Error Units
      SegmentBulk2Fill.heapSegmentFillJava 65536 avgt 30 662.179 ? 23.403 ns/op
      SegmentBulk2Fill.nativeSegmentFillJava 65536 avgt 30 650.022 ? 11.491 ns/op

      Loop

      Benchmark (ELEM_SIZE) Mode Cnt Score Error Units
      SegmentBulk2Fill.heapSegmentFillJava 65536 avgt 30 7314.986 ? 11931.163 ns/op
      SegmentBulk2Fill.nativeSegmentFillJava 65536 avgt 30 658.273 ? 16.371 ns/op

      The C2 compiler is using superword (16 bytes) and unroll 16:

      $ sudo make test TEST="micro:java.lang.foreign.SegmentBulk2Fill" MICRO="VM_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:CompileCommand=TraceAutoVectorization,*Bulk2Fill.heapSegmentFillJava,ALL;OPTIONS=-prof dtraceasm -p ELEM_SIZE=65536" CONF=macosx-aarch64-debug

      ....[Hottest Region 1]..............................................................................
      c2, level 4, org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub, version 5, compile id 717

                   0x00000001162887ec: mov w8, #0xe800 // #59392
                   0x00000001162887f0: movk w8, #0x3, lsl #16
                   0x00000001162887f4: cmp w13, w8
                   0x00000001162887f8: csel w11, w12, w13, hi // hi = pmore
                   0x00000001162887fc: add w13, w11, w4 ;*getstatic SCOPED_MEMORY_ACCESS {reexecute=0 rethrow=0 return_oop=0}
                                                                             ; - jdk.internal.foreign.SegmentBulkOperations::fill@46 (line 75)
                                                                             ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
                                                                             ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
                                                                             ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
                  ;; B42: # out( B42 B43 ) &lt;- in( B41 B42 ) Loop( B42-B42 inner main of N374 strip mined) Freq: 6.42119e+14
         0.04% ? 0x0000000116288800: add x11, x21, w4, sxtw
                ? 0x0000000116288804: str q16, [x11]
                ? 0x0000000116288808: str q16, [x11, #16]
                ? 0x000000011628880c: str q16, [x11, #32]
                ? 0x0000000116288810: str q16, [x11, #48]
                ? 0x0000000116288814: str q16, [x11, #64]
         0.08% ? 0x0000000116288818: str q16, [x11, #80]
        15.41% ? 0x000000011628881c: str q16, [x11, #96]
         7.20% ? 0x0000000116288820: str q16, [x11, #112]
         0.02% ? 0x0000000116288824: str q16, [x11, #128]
                ? 0x0000000116288828: str q16, [x11, #144]
        10.72% ? 0x000000011628882c: str q16, [x11, #160]
         0.96% ? 0x0000000116288830: str q16, [x11, #176]
                ? 0x0000000116288834: str q16, [x11, #192]
                ? 0x0000000116288838: str q16, [x11, #208]
        24.22% ? 0x000000011628883c: str q16, [x11, #224]
        34.94% ? 0x0000000116288840: str q16, [x11, #240] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
                ? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)
                ? ; - jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)
                ? ; - jdk.internal.foreign.SegmentBulkOperations::fill@65 (line 75)
                ? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
                ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
                ? ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
                ? 0x0000000116288844: add w4, w4, #0x100 ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                ? ; - jdk.internal.foreign.SegmentBulkOperations::fill@68 (line 74)
                ? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
                ? ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
                ? ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
                ? 0x0000000116288848: cmp w4, w13
                ? 0x000000011628884c: b.lt 0x0000000116288800 // b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                                             ; - jdk.internal.foreign.SegmentBulkOperations::fill@43 (line 74)
                                                                             ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
                                                                             ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
                                                                             ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
                  ;; B43: # out( B41 B44 ) &lt;- in( B42 ) Freq: 6.81265e+09
         0.55% 0x0000000116288850: ldr x6, [x28, #48] ; ImmutableOopMap {r14=Oop r16=Oop c_rarg2=Oop c_rarg5=Derived_oop_c_rarg2 r19=Oop }
                                                                             ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                                             ; - (reexecute) jdk.internal.foreign.SegmentBulkOperations::fill@71 (line 74)
                                                                             ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
                                                                             ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
                                                                             ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
                   0x0000000116288854: ldr wzr, [x6] ; {poll}
         0.08% 0x0000000116288858: ldrb w8, [x28, #1184]
                   0x000000011628885c: cbz x8, 0x0000000116288874
                  ;; 0x104DAB6FC
                   0x0000000116288860: mov x8, #0xb6fc // #46844
                                                                             ; {runtime_call JavaThread::verify_cross_modify_fence_failure(JavaThread*)}
                   0x0000000116288864: movk x8, #0x4da, lsl #16
                   0x0000000116288868: movk x8, #0x1, lsl #32
                   0x000000011628886c: mov x0, x28
                   0x0000000116288870: blr x8 ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                                             ; - jdk.internal.foreign.SegmentBulkOperations::fill@71 (line 74)
                                                                             ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
      ....................................................................................................

        1. trace.txt
          1.08 MB
          Per-Ake Minborg

            epeter Emanuel Peter
            pminborg Per-Ake Minborg
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: