-
Bug
-
Resolution: Fixed
-
P2
-
21, 22
-
b25
-
Verified
The attached benchmark exhibits very poor performance when using the MemorySegment API - memory segment are 2x slower compared to semantically equivalent Unsafe code.
While the generated code in the two cases is very similar, the memory segment version seems to have this sequence of instruction at the end of the inner loop:
```
1.38% 0x00007fcaa44bbaca: mov $0x80,%r11d
0x00007fcaa44bbad0: cmova %r11d,%r9d ; {no_reloc}
4.66% 0x00007fcaa44bbad4: lea 0x1(%rsi),%rbp
0x00007fcaa44bbad8: movslq %r9d,%r10
4.03% 0x00007fcaa44bbadb: lea 0x1(%rsi,%r10,1),%r11
4.42% 0x00007fcaa44bbae0: cmp %rbp,%r11
1.69% 0x00007fcaa44bbae3: movabs $0x7fffffffffffffff,%rax
0x00007fcaa44bbaed: cmovl %rax,%r11
8.53% 0x00007fcaa44bbaf1: cmp %rax,%r11
2.19% 0x00007fcaa44bbaf4: cmovge %rax,%r11
4.52% 0x00007fcaa44bbaf8: mov %rbp,%rax
0x00007fcaa44bbafb: xor %r13d,%r13d
0x00007fcaa44bbafe: test %rbp,%rbp
0x00007fcaa44bbb01: cmovl %r13,%rax
0.02% 0x00007fcaa44bbb05: sub %rax,%rsi
0x00007fcaa44bbb08: cmp %r11,%rax
4.62% 0x00007fcaa44bbb0b: mov %rax,%r13
0x00007fcaa44bbb0e: cmovl %r11,%r13
4.20% 0x00007fcaa44bbb12: sub %rax,%r13
5.38% 0x00007fcaa44bbb15: mov %esi,%eax
0x00007fcaa44bbb17: mov %r13d,%esi
0x00007fcaa44bbb1a: inc %eax
0x00007fcaa44bbb1c: cmp %esi,%eax
2.85% 0x00007fcaa44bbb1e: jae 0x00007fcaa44bbcec
0x00007fcaa44bbb24: vmovq %xmm2,%r11
0x00007fcaa44bbb29: add %r8,%r11
0x00007fcaa44bbb2c: mov %r14,%rbp
0x00007fcaa44bbb2f: add %r8,%rbp
0x00007fcaa44bbb32: movsbl 0x10(%r11),%r8d ;*baload {reexecute=0 rethrow=0 return_oop=0}
; - org.openjdk.bench.java.lang.foreign.BinarySearch::binarySearch_panama@121 (line 98)
; - org.openjdk.bench.java.lang.foreign.jmh_generated.BinarySearch_binarySearch_panama_jmhTest::binarySearch_panama_avgt_jmhStub@17 (line 186)
0x00007fcaa44bbb37: movsbl 0x1(%rbp),%r13d ;*invokevirtual getByte {reexecute=0 rethrow=0 return_oop=0}
; - jdk.internal.misc.ScopedMemoryAccess::getByteInternal@13 (line 528)
; - jdk.internal.misc.ScopedMemoryAccess::getByte@4 (line 516)
```
Which is the likely cause of the performance delta.
This issue was reported here:
https://git.openjdk.org/panama-foreign/pull/844
And then further discussed here:
https://mail.openjdk.org/pipermail/panama-dev/2023-July/019369.html
While the generated code in the two cases is very similar, the memory segment version seems to have this sequence of instruction at the end of the inner loop:
```
1.38% 0x00007fcaa44bbaca: mov $0x80,%r11d
0x00007fcaa44bbad0: cmova %r11d,%r9d ; {no_reloc}
4.66% 0x00007fcaa44bbad4: lea 0x1(%rsi),%rbp
0x00007fcaa44bbad8: movslq %r9d,%r10
4.03% 0x00007fcaa44bbadb: lea 0x1(%rsi,%r10,1),%r11
4.42% 0x00007fcaa44bbae0: cmp %rbp,%r11
1.69% 0x00007fcaa44bbae3: movabs $0x7fffffffffffffff,%rax
0x00007fcaa44bbaed: cmovl %rax,%r11
8.53% 0x00007fcaa44bbaf1: cmp %rax,%r11
2.19% 0x00007fcaa44bbaf4: cmovge %rax,%r11
4.52% 0x00007fcaa44bbaf8: mov %rbp,%rax
0x00007fcaa44bbafb: xor %r13d,%r13d
0x00007fcaa44bbafe: test %rbp,%rbp
0x00007fcaa44bbb01: cmovl %r13,%rax
0.02% 0x00007fcaa44bbb05: sub %rax,%rsi
0x00007fcaa44bbb08: cmp %r11,%rax
4.62% 0x00007fcaa44bbb0b: mov %rax,%r13
0x00007fcaa44bbb0e: cmovl %r11,%r13
4.20% 0x00007fcaa44bbb12: sub %rax,%r13
5.38% 0x00007fcaa44bbb15: mov %esi,%eax
0x00007fcaa44bbb17: mov %r13d,%esi
0x00007fcaa44bbb1a: inc %eax
0x00007fcaa44bbb1c: cmp %esi,%eax
2.85% 0x00007fcaa44bbb1e: jae 0x00007fcaa44bbcec
0x00007fcaa44bbb24: vmovq %xmm2,%r11
0x00007fcaa44bbb29: add %r8,%r11
0x00007fcaa44bbb2c: mov %r14,%rbp
0x00007fcaa44bbb2f: add %r8,%rbp
0x00007fcaa44bbb32: movsbl 0x10(%r11),%r8d ;*baload {reexecute=0 rethrow=0 return_oop=0}
; - org.openjdk.bench.java.lang.foreign.BinarySearch::binarySearch_panama@121 (line 98)
; - org.openjdk.bench.java.lang.foreign.jmh_generated.BinarySearch_binarySearch_panama_jmhTest::binarySearch_panama_avgt_jmhStub@17 (line 186)
0x00007fcaa44bbb37: movsbl 0x1(%rbp),%r13d ;*invokevirtual getByte {reexecute=0 rethrow=0 return_oop=0}
; - jdk.internal.misc.ScopedMemoryAccess::getByteInternal@13 (line 528)
; - jdk.internal.misc.ScopedMemoryAccess::getByte@4 (line 516)
```
Which is the likely cause of the performance delta.
This issue was reported here:
https://git.openjdk.org/panama-foreign/pull/844
And then further discussed here:
https://mail.openjdk.org/pipermail/panama-dev/2023-July/019369.html