-
Enhancement
-
Resolution: Unresolved
-
P3
-
24
In the class `jdk.internal.foreign.SegmentBulkOperations` there is a method `fill()`. Said method is manually written to use long -> int -> short -> byte operations to maximize unit size during segment traversal.
It would be tempting to replace that method with something like this:
final int end = (int) dst.length;
// Rely on aligned auto vectorization
for (int i = 0 ; i < end; i++) {
SCOPED_MEMORY_ACCESS.putByte(dst.sessionImpl(), dst.unsafeGetBase(), dst.unsafeGetOffset() + i, value);
}
The C2/Grall would then be able to generate even more optimized constructs such as using super-words and unrolling. However, at least on macOS M1, the C2 is slower or equal:
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(value = 3)
public class SegmentBulk2Fill {
@Param({"8", "32", "512", "2048", "32768"})
public int ELEM_SIZE;
byte[] array;
MemorySegment heapSegment;
MemorySegment nativeSegment;
ByteBuffer buffer;
@Setup
public void setup() {
array = new byte[ELEM_SIZE];
heapSegment = MemorySegment.ofArray(array);
nativeSegment = Arena.ofAuto().allocate(ELEM_SIZE, 8);
buffer = ByteBuffer.wrap(array);
}
@Fork(value = 3, jvmArgs = {"-Djava.lang.foreign.native.threshold.power.fill=31"})
@Benchmark
public void heapSegmentFillJava() {
heapSegment.fill((byte) 0);
}
@Fork(value = 3, jvmArgs = {"-Djava.lang.foreign.native.threshold.power.fill=31"})
@Benchmark
public void nativeSegmentFillJava() {
nativeSegment.fill((byte) 0);
}
}
$ make test TEST="micro:java.lang.foreign.SegmentBulk2Fill" MICRO="OPTIONS=-p ELEM_SIZE=65536"
Base
Benchmark (ELEM_SIZE) Mode Cnt Score Error Units
SegmentBulk2Fill.heapSegmentFillJava 65536 avgt 30 662.179 ? 23.403 ns/op
SegmentBulk2Fill.nativeSegmentFillJava 65536 avgt 30 650.022 ? 11.491 ns/op
Loop
Benchmark (ELEM_SIZE) Mode Cnt Score Error Units
SegmentBulk2Fill.heapSegmentFillJava 65536 avgt 30 7314.986 ? 11931.163 ns/op
SegmentBulk2Fill.nativeSegmentFillJava 65536 avgt 30 658.273 ? 16.371 ns/op
The C2 compiler is using superword (16 bytes) and unroll 16:
$ sudo make test TEST="micro:java.lang.foreign.SegmentBulk2Fill" MICRO="VM_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:CompileCommand=TraceAutoVectorization,*Bulk2Fill.heapSegmentFillJava,ALL;OPTIONS=-prof dtraceasm -p ELEM_SIZE=65536" CONF=macosx-aarch64-debug
....[Hottest Region 1]..............................................................................
c2, level 4, org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub, version 5, compile id 717
0x00000001162887ec: mov w8, #0xe800 // #59392
0x00000001162887f0: movk w8, #0x3, lsl #16
0x00000001162887f4: cmp w13, w8
0x00000001162887f8: csel w11, w12, w13, hi // hi = pmore
0x00000001162887fc: add w13, w11, w4 ;*getstatic SCOPED_MEMORY_ACCESS {reexecute=0 rethrow=0 return_oop=0}
; - jdk.internal.foreign.SegmentBulkOperations::fill@46 (line 75)
; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
;; B42: # out( B42 B43 ) <- in( B41 B42 ) Loop( B42-B42 inner main of N374 strip mined) Freq: 6.42119e+14
0.04% ? 0x0000000116288800: add x11, x21, w4, sxtw
? 0x0000000116288804: str q16, [x11]
? 0x0000000116288808: str q16, [x11, #16]
? 0x000000011628880c: str q16, [x11, #32]
? 0x0000000116288810: str q16, [x11, #48]
? 0x0000000116288814: str q16, [x11, #64]
0.08% ? 0x0000000116288818: str q16, [x11, #80]
15.41% ? 0x000000011628881c: str q16, [x11, #96]
7.20% ? 0x0000000116288820: str q16, [x11, #112]
0.02% ? 0x0000000116288824: str q16, [x11, #128]
? 0x0000000116288828: str q16, [x11, #144]
10.72% ? 0x000000011628882c: str q16, [x11, #160]
0.96% ? 0x0000000116288830: str q16, [x11, #176]
? 0x0000000116288834: str q16, [x11, #192]
? 0x0000000116288838: str q16, [x11, #208]
24.22% ? 0x000000011628883c: str q16, [x11, #224]
34.94% ? 0x0000000116288840: str q16, [x11, #240] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)
? ; - jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)
? ; - jdk.internal.foreign.SegmentBulkOperations::fill@65 (line 75)
? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
? ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
? ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
? 0x0000000116288844: add w4, w4, #0x100 ;*iinc {reexecute=0 rethrow=0 return_oop=0}
? ; - jdk.internal.foreign.SegmentBulkOperations::fill@68 (line 74)
? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
? ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
? ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
? 0x0000000116288848: cmp w4, w13
? 0x000000011628884c: b.lt 0x0000000116288800 // b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - jdk.internal.foreign.SegmentBulkOperations::fill@43 (line 74)
; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
;; B43: # out( B41 B44 ) <- in( B42 ) Freq: 6.81265e+09
0.55% 0x0000000116288850: ldr x6, [x28, #48] ; ImmutableOopMap {r14=Oop r16=Oop c_rarg2=Oop c_rarg5=Derived_oop_c_rarg2 r19=Oop }
;*goto {reexecute=1 rethrow=0 return_oop=0}
; - (reexecute) jdk.internal.foreign.SegmentBulkOperations::fill@71 (line 74)
; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
0x0000000116288854: ldr wzr, [x6] ; {poll}
0.08% 0x0000000116288858: ldrb w8, [x28, #1184]
0x000000011628885c: cbz x8, 0x0000000116288874
;; 0x104DAB6FC
0x0000000116288860: mov x8, #0xb6fc // #46844
; {runtime_call JavaThread::verify_cross_modify_fence_failure(JavaThread*)}
0x0000000116288864: movk x8, #0x4da, lsl #16
0x0000000116288868: movk x8, #0x1, lsl #32
0x000000011628886c: mov x0, x28
0x0000000116288870: blr x8 ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - jdk.internal.foreign.SegmentBulkOperations::fill@71 (line 74)
; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
....................................................................................................
It would be tempting to replace that method with something like this:
final int end = (int) dst.length;
// Rely on aligned auto vectorization
for (int i = 0 ; i < end; i++) {
SCOPED_MEMORY_ACCESS.putByte(dst.sessionImpl(), dst.unsafeGetBase(), dst.unsafeGetOffset() + i, value);
}
The C2/Grall would then be able to generate even more optimized constructs such as using super-words and unrolling. However, at least on macOS M1, the C2 is slower or equal:
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(value = 3)
public class SegmentBulk2Fill {
@Param({"8", "32", "512", "2048", "32768"})
public int ELEM_SIZE;
byte[] array;
MemorySegment heapSegment;
MemorySegment nativeSegment;
ByteBuffer buffer;
@Setup
public void setup() {
array = new byte[ELEM_SIZE];
heapSegment = MemorySegment.ofArray(array);
nativeSegment = Arena.ofAuto().allocate(ELEM_SIZE, 8);
buffer = ByteBuffer.wrap(array);
}
@Fork(value = 3, jvmArgs = {"-Djava.lang.foreign.native.threshold.power.fill=31"})
@Benchmark
public void heapSegmentFillJava() {
heapSegment.fill((byte) 0);
}
@Fork(value = 3, jvmArgs = {"-Djava.lang.foreign.native.threshold.power.fill=31"})
@Benchmark
public void nativeSegmentFillJava() {
nativeSegment.fill((byte) 0);
}
}
$ make test TEST="micro:java.lang.foreign.SegmentBulk2Fill" MICRO="OPTIONS=-p ELEM_SIZE=65536"
Base
Benchmark (ELEM_SIZE) Mode Cnt Score Error Units
SegmentBulk2Fill.heapSegmentFillJava 65536 avgt 30 662.179 ? 23.403 ns/op
SegmentBulk2Fill.nativeSegmentFillJava 65536 avgt 30 650.022 ? 11.491 ns/op
Loop
Benchmark (ELEM_SIZE) Mode Cnt Score Error Units
SegmentBulk2Fill.heapSegmentFillJava 65536 avgt 30 7314.986 ? 11931.163 ns/op
SegmentBulk2Fill.nativeSegmentFillJava 65536 avgt 30 658.273 ? 16.371 ns/op
The C2 compiler is using superword (16 bytes) and unroll 16:
$ sudo make test TEST="micro:java.lang.foreign.SegmentBulk2Fill" MICRO="VM_OPTIONS=-XX:+UnlockDiagnosticVMOptions -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:CompileCommand=TraceAutoVectorization,*Bulk2Fill.heapSegmentFillJava,ALL;OPTIONS=-prof dtraceasm -p ELEM_SIZE=65536" CONF=macosx-aarch64-debug
....[Hottest Region 1]..............................................................................
c2, level 4, org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub, version 5, compile id 717
0x00000001162887ec: mov w8, #0xe800 // #59392
0x00000001162887f0: movk w8, #0x3, lsl #16
0x00000001162887f4: cmp w13, w8
0x00000001162887f8: csel w11, w12, w13, hi // hi = pmore
0x00000001162887fc: add w13, w11, w4 ;*getstatic SCOPED_MEMORY_ACCESS {reexecute=0 rethrow=0 return_oop=0}
; - jdk.internal.foreign.SegmentBulkOperations::fill@46 (line 75)
; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
;; B42: # out( B42 B43 ) <- in( B41 B42 ) Loop( B42-B42 inner main of N374 strip mined) Freq: 6.42119e+14
0.04% ? 0x0000000116288800: add x11, x21, w4, sxtw
? 0x0000000116288804: str q16, [x11]
? 0x0000000116288808: str q16, [x11, #16]
? 0x000000011628880c: str q16, [x11, #32]
? 0x0000000116288810: str q16, [x11, #48]
? 0x0000000116288814: str q16, [x11, #64]
0.08% ? 0x0000000116288818: str q16, [x11, #80]
15.41% ? 0x000000011628881c: str q16, [x11, #96]
7.20% ? 0x0000000116288820: str q16, [x11, #112]
0.02% ? 0x0000000116288824: str q16, [x11, #128]
? 0x0000000116288828: str q16, [x11, #144]
10.72% ? 0x000000011628882c: str q16, [x11, #160]
0.96% ? 0x0000000116288830: str q16, [x11, #176]
? 0x0000000116288834: str q16, [x11, #192]
? 0x0000000116288838: str q16, [x11, #208]
24.22% ? 0x000000011628883c: str q16, [x11, #224]
34.94% ? 0x0000000116288840: str q16, [x11, #240] ;*invokevirtual putByte {reexecute=0 rethrow=0 return_oop=0}
? ; - jdk.internal.misc.ScopedMemoryAccess::putByteInternal@15 (line 534)
? ; - jdk.internal.misc.ScopedMemoryAccess::putByte@6 (line 522)
? ; - jdk.internal.foreign.SegmentBulkOperations::fill@65 (line 75)
? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
? ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
? ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
? 0x0000000116288844: add w4, w4, #0x100 ;*iinc {reexecute=0 rethrow=0 return_oop=0}
? ; - jdk.internal.foreign.SegmentBulkOperations::fill@68 (line 74)
? ; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
? ; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
? ; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
? 0x0000000116288848: cmp w4, w13
? 0x000000011628884c: b.lt 0x0000000116288800 // b.tstop;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
; - jdk.internal.foreign.SegmentBulkOperations::fill@43 (line 74)
; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
;; B43: # out( B41 B44 ) <- in( B42 ) Freq: 6.81265e+09
0.55% 0x0000000116288850: ldr x6, [x28, #48] ; ImmutableOopMap {r14=Oop r16=Oop c_rarg2=Oop c_rarg5=Derived_oop_c_rarg2 r19=Oop }
;*goto {reexecute=1 rethrow=0 return_oop=0}
; - (reexecute) jdk.internal.foreign.SegmentBulkOperations::fill@71 (line 74)
; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
; - org.openjdk.bench.java.lang.foreign.SegmentBulk2Fill::nativeSegmentFillJava@5 (line 96)
; - org.openjdk.bench.java.lang.foreign.jmh_generated.SegmentBulk2Fill_nativeSegmentFillJava_jmhTest::nativeSegmentFillJava_avgt_jmhStub@15 (line 190)
0x0000000116288854: ldr wzr, [x6] ; {poll}
0.08% 0x0000000116288858: ldrb w8, [x28, #1184]
0x000000011628885c: cbz x8, 0x0000000116288874
;; 0x104DAB6FC
0x0000000116288860: mov x8, #0xb6fc // #46844
; {runtime_call JavaThread::verify_cross_modify_fence_failure(JavaThread*)}
0x0000000116288864: movk x8, #0x4da, lsl #16
0x0000000116288868: movk x8, #0x1, lsl #32
0x000000011628886c: mov x0, x28
0x0000000116288870: blr x8 ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - jdk.internal.foreign.SegmentBulkOperations::fill@71 (line 74)
; - jdk.internal.foreign.AbstractMemorySegmentImpl::fill@2 (line 184)
....................................................................................................
- relates to
-
JDK-8343844 Add benchmarks for superword/autovectorization in FFM BulkOperations
- Resolved