Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8311932

Suboptimal compiled code of nested loop over memory segment

XMLWordPrintable

      The attached benchmark exhibits very poor performance when using the MemorySegment API - memory segment are 2x slower compared to semantically equivalent Unsafe code.

      While the generated code in the two cases is very similar, the memory segment version seems to have this sequence of instruction at the end of the inner loop:

      ```
        1.38% 0x00007fcaa44bbaca: mov $0x80,%r11d
                  0x00007fcaa44bbad0: cmova %r11d,%r9d ; {no_reloc}
        4.66% 0x00007fcaa44bbad4: lea 0x1(%rsi),%rbp
                  0x00007fcaa44bbad8: movslq %r9d,%r10
        4.03% 0x00007fcaa44bbadb: lea 0x1(%rsi,%r10,1),%r11
        4.42% 0x00007fcaa44bbae0: cmp %rbp,%r11
        1.69% 0x00007fcaa44bbae3: movabs $0x7fffffffffffffff,%rax
                  0x00007fcaa44bbaed: cmovl %rax,%r11
        8.53% 0x00007fcaa44bbaf1: cmp %rax,%r11
        2.19% 0x00007fcaa44bbaf4: cmovge %rax,%r11
        4.52% 0x00007fcaa44bbaf8: mov %rbp,%rax
                  0x00007fcaa44bbafb: xor %r13d,%r13d
                  0x00007fcaa44bbafe: test %rbp,%rbp
                  0x00007fcaa44bbb01: cmovl %r13,%rax
        0.02% 0x00007fcaa44bbb05: sub %rax,%rsi
                  0x00007fcaa44bbb08: cmp %r11,%rax
        4.62% 0x00007fcaa44bbb0b: mov %rax,%r13
                  0x00007fcaa44bbb0e: cmovl %r11,%r13
        4.20% 0x00007fcaa44bbb12: sub %rax,%r13
        5.38% 0x00007fcaa44bbb15: mov %esi,%eax
                  0x00007fcaa44bbb17: mov %r13d,%esi
                  0x00007fcaa44bbb1a: inc %eax
                  0x00007fcaa44bbb1c: cmp %esi,%eax
        2.85% 0x00007fcaa44bbb1e: jae 0x00007fcaa44bbcec
                  0x00007fcaa44bbb24: vmovq %xmm2,%r11
                  0x00007fcaa44bbb29: add %r8,%r11
                  0x00007fcaa44bbb2c: mov %r14,%rbp
                  0x00007fcaa44bbb2f: add %r8,%rbp
                  0x00007fcaa44bbb32: movsbl 0x10(%r11),%r8d ;*baload {reexecute=0 rethrow=0 return_oop=0}
                                                                            ; - org.openjdk.bench.java.lang.foreign.BinarySearch::binarySearch_panama@121 (line 98)
                                                                            ; - org.openjdk.bench.java.lang.foreign.jmh_generated.BinarySearch_binarySearch_panama_jmhTest::binarySearch_panama_avgt_jmhStub@17 (line 186)
                  0x00007fcaa44bbb37: movsbl 0x1(%rbp),%r13d ;*invokevirtual getByte {reexecute=0 rethrow=0 return_oop=0}
                                                                            ; - jdk.internal.misc.ScopedMemoryAccess::getByteInternal@13 (line 528)
                                                                            ; - jdk.internal.misc.ScopedMemoryAccess::getByte@4 (line 516)
      ```

      Which is the likely cause of the performance delta.

      This issue was reported here:

      https://git.openjdk.org/panama-foreign/pull/844

      And then further discussed here:

      https://mail.openjdk.org/pipermail/panama-dev/2023-July/019369.html

        1. BinarySearch.java
          25 kB
          Maurizio Cimadamore
        2. BinarySearchInstance.java
          29 kB
          Maurizio Cimadamore
        3. BinarySearchMini.java
          27 kB
          Maurizio Cimadamore

            roland Roland Westrelin
            mcimadamore Maurizio Cimadamore
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: