Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8148929

Suboptimal code generated when setting sysroot include with Solaris Studio

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P2 P2
    • 9
    • 9
    • infrastructure
    • b105

        While implementing some performance sensitive logic in Hotspot I noticed that the performance of the generated (c++) code on Solaris-x64 was not as good as it should be. A deeper analysis showed that this is related to a problem with inlining/intrinsifying various memcpy calls. I tracked this down to whether or not the "sysroot" is explicitly included when compiling the c++ file(s).

        Specifically, here's a small reproducer:

        ----
        #include <stdlib.h>
        #include <stdint.h>
        #include <string.h>

        uint64_t read_unaligned(void* src) {
          uint64_t tmp;

          memcpy(&tmp, src, sizeof(uint64_t));

          return tmp;
        }

        When compiled without an explicit sysroot include path like so:

        SS12u4-Solaris11u1/SS12u4/bin/CC -m64 -G -xO4 -o libfoo.so unaligned_read.cpp

        The resulting assembly looks like this:

        0000000000000bf0 <__1cOread_unaligned6Fpv_L_>:
         bf0: 55 push %rbp
         bf1: 48 8b ec mov %rsp,%rbp
         bf4: 48 8b 07 mov (%rdi),%rax
         bf7: 48 89 45 f8 mov %rax,-0x8(%rbp)
         bfb: 48 8b 45 f8 mov -0x8(%rbp),%rax
         bff: c9 leaveq
         c00: c3 retq

        That is, the compiler has "inlined" memcpy and is just reading the value using a normal mov.

        However, when the code is compiled *with* an explicit sysroot include like so:

        SS12u4-Solaris11u1/SS12u4/bin/CC -m64 -G -I/opt/jprt/products/P1/SS12u4-Solaris11u1/SS12u4-Solaris11u1/sysroot/usr/include -xO4 -o libfoo.so unaligned_read.cpp

        The resulting code looks like this:

        0000000000000c40 <__1cOread_unaligned6Fpv_L_>:
         c40: 55 push %rbp
         c41: 48 8b ec mov %rsp,%rbp
         c44: 48 83 ec 10 sub $0x10,%rsp
         c48: 48 8b f7 mov %rdi,%rsi
         c4b: 48 8d 45 f8 lea -0x8(%rbp),%rax
         c4f: 48 8b f8 mov %rax,%rdi
         c52: 48 c7 c2 08 00 00 00 mov $0x8,%rdx
         c59: e8 8a ff ff ff callq be8 <memcpy@plt>
         c5e: 48 8b 45 f8 mov -0x8(%rbp),%rax
         c62: c9 leaveq
         c63: c3 retq

        That is, the memcpy is still there.

        The performance difference here is significant, especially if the code in question happens to be in a hot loop.

          1. unaligned_read.cpp
            0.2 kB
            Mikael Vidstedt
          2. unaligned-read-nosysroot.txt
            5 kB
            Mikael Vidstedt
          3. unaligned-read-sysroot.txt
            5 kB
            Mikael Vidstedt

              erikj Erik Joelsson
              mikael Mikael Vidstedt
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: