-
Bug
-
Resolution: Fixed
-
P2
-
9
-
b105
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8246850 | emb-8u261 | Erik Joelsson | P2 | Resolved | Fixed | team |
While implementing some performance sensitive logic in Hotspot I noticed that the performance of the generated (c++) code on Solaris-x64 was not as good as it should be. A deeper analysis showed that this is related to a problem with inlining/intrinsifying various memcpy calls. I tracked this down to whether or not the "sysroot" is explicitly included when compiling the c++ file(s).
Specifically, here's a small reproducer:
----
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
uint64_t read_unaligned(void* src) {
uint64_t tmp;
memcpy(&tmp, src, sizeof(uint64_t));
return tmp;
}
When compiled without an explicit sysroot include path like so:
SS12u4-Solaris11u1/SS12u4/bin/CC -m64 -G -xO4 -o libfoo.so unaligned_read.cpp
The resulting assembly looks like this:
0000000000000bf0 <__1cOread_unaligned6Fpv_L_>:
bf0: 55 push %rbp
bf1: 48 8b ec mov %rsp,%rbp
bf4: 48 8b 07 mov (%rdi),%rax
bf7: 48 89 45 f8 mov %rax,-0x8(%rbp)
bfb: 48 8b 45 f8 mov -0x8(%rbp),%rax
bff: c9 leaveq
c00: c3 retq
That is, the compiler has "inlined" memcpy and is just reading the value using a normal mov.
However, when the code is compiled *with* an explicit sysroot include like so:
SS12u4-Solaris11u1/SS12u4/bin/CC -m64 -G -I/opt/jprt/products/P1/SS12u4-Solaris11u1/SS12u4-Solaris11u1/sysroot/usr/include -xO4 -o libfoo.so unaligned_read.cpp
The resulting code looks like this:
0000000000000c40 <__1cOread_unaligned6Fpv_L_>:
c40: 55 push %rbp
c41: 48 8b ec mov %rsp,%rbp
c44: 48 83 ec 10 sub $0x10,%rsp
c48: 48 8b f7 mov %rdi,%rsi
c4b: 48 8d 45 f8 lea -0x8(%rbp),%rax
c4f: 48 8b f8 mov %rax,%rdi
c52: 48 c7 c2 08 00 00 00 mov $0x8,%rdx
c59: e8 8a ff ff ff callq be8 <memcpy@plt>
c5e: 48 8b 45 f8 mov -0x8(%rbp),%rax
c62: c9 leaveq
c63: c3 retq
That is, the memcpy is still there.
The performance difference here is significant, especially if the code in question happens to be in a hot loop.
Specifically, here's a small reproducer:
----
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
uint64_t read_unaligned(void* src) {
uint64_t tmp;
memcpy(&tmp, src, sizeof(uint64_t));
return tmp;
}
When compiled without an explicit sysroot include path like so:
SS12u4-Solaris11u1/SS12u4/bin/CC -m64 -G -xO4 -o libfoo.so unaligned_read.cpp
The resulting assembly looks like this:
0000000000000bf0 <__1cOread_unaligned6Fpv_L_>:
bf0: 55 push %rbp
bf1: 48 8b ec mov %rsp,%rbp
bf4: 48 8b 07 mov (%rdi),%rax
bf7: 48 89 45 f8 mov %rax,-0x8(%rbp)
bfb: 48 8b 45 f8 mov -0x8(%rbp),%rax
bff: c9 leaveq
c00: c3 retq
That is, the compiler has "inlined" memcpy and is just reading the value using a normal mov.
However, when the code is compiled *with* an explicit sysroot include like so:
SS12u4-Solaris11u1/SS12u4/bin/CC -m64 -G -I/opt/jprt/products/P1/SS12u4-Solaris11u1/SS12u4-Solaris11u1/sysroot/usr/include -xO4 -o libfoo.so unaligned_read.cpp
The resulting code looks like this:
0000000000000c40 <__1cOread_unaligned6Fpv_L_>:
c40: 55 push %rbp
c41: 48 8b ec mov %rsp,%rbp
c44: 48 83 ec 10 sub $0x10,%rsp
c48: 48 8b f7 mov %rdi,%rsi
c4b: 48 8d 45 f8 lea -0x8(%rbp),%rax
c4f: 48 8b f8 mov %rax,%rdi
c52: 48 c7 c2 08 00 00 00 mov $0x8,%rdx
c59: e8 8a ff ff ff callq be8 <memcpy@plt>
c5e: 48 8b 45 f8 mov -0x8(%rbp),%rax
c62: c9 leaveq
c63: c3 retq
That is, the memcpy is still there.
The performance difference here is significant, especially if the code in question happens to be in a hot loop.
- backported by
-
JDK-8246850 Suboptimal code generated when setting sysroot include with Solaris Studio
- Resolved