-
Bug
-
Resolution: Fixed
-
P3
-
20
-
b25
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8327524 | 17.0.12-oracle | Joe Cherian | P3 | Resolved | Fixed | b01 |
JDK-8317304 | 17.0.10 | Goetz Lindenmaier | P3 | Resolved | Fixed | b01 |
While investigating the performance of the os::malloc wrapper, I noticed that we spend a lot of cycles copying empty callstacks around, even if NMT is disabled.
The CURRENT_PC and CALLER_PC macros are used to create `NativeCallStack` objects out of thin air :
```
#define CURRENT_PC ((MemTracker::tracking_level() == NMT_detail) ? \
NativeCallStack(0) : NativeCallStack::empty_stack())
#define CALLER_PC ((MemTracker::tracking_level() == NMT_detail) ? \
NativeCallStack(1) : NativeCallStack::empty_stack())
```
and feed them to a callee routine, which usually has the argument defined via const reference, e.g. os::malloc:
```
void* os::malloc(size_t size, MEMFLAGS memflags, const NativeCallStack& stack);
```
In CURRENT|CALLER_PC, the left hand of the ':' operator handles the detail mode, when we actually do collect a stack. In that case, the stack sits on the thread stack as an automatic anonymous variable and is filled by the stack walker. That's all fine.
The right-hand of ':' handles the case when we don't want a stack. In that case, the intent is to hand down the reference to a pre-created "empty stack" singleton (NativeCallStack::empty_stack()).
However, that does not work as intended. The C++ compiler - at least gcc on linux - generates code that laboriously copies the content of the empty stack singleton onto the thread stack. It uses four SSE instructions - two 16byte loads + two 16byte moves (the NMT stacks are 4 pointer-sized slots containing PCs):
```
0000000000cb9a60 <_ZN2os6mallocEm8MEMFLAGS>:
...
# Load tracking level
cb9a77: 48 8d 1d 02 35 78 00 lea 0x783502(%rip),%rbx # 143cf80 <_ZN10MemTracker15_tracking_levelE>
cb9a7e: 8b 03 mov (%rbx),%eax
# detail (3) tracking?
cb9a80: 83 f8 03 cmp $0x3,%eax
# yes: go and collect callstack
cb9a83: 0f 84 57 01 00 00 je cb9be0 <_ZN2os6mallocEm8MEMFLAGS+0x180>
# no: copy the content of NativeCallStack::_empty_stack to the local stack, in 16 byte intervals:
cb9a89: 48 8d 05 30 44 78 00 lea 0x784430(%rip),%rax # 143dec0 <_ZN15NativeCallStack12_empty_stackE>
cb9a90: f3 0f 6f 00 movdqu (%rax),%xmm0
cb9a94: f3 0f 6f 48 10 movdqu 0x10(%rax),%xmm1
cb9a99: 0f 11 45 c0 movups %xmm0,-0x40(%rbp)
cb9a9d: 0f 11 4d d0 movups %xmm1,-0x30(%rbp)
...
# do the actual malloc:
cb9af8: e8 c3 40 5d ff callq 28dbc0 <malloc@plt>
# call MallocTracker::record_malloc() and hand down pointer to NMT stack (4th argument->RCX):
cb9b0f: 48 8d 4d c0 lea -0x40(%rbp),%rcx
...
cb9b19: e8 f2 b7 f3 ff callq bf5310 <_ZN13MallocTracker13record_mallocEPvm8MEMFLAGSRK15NativeCallStack>
```
This is completely unnecessary, since if NMT mode != detail, the stack is never used. This hits every call site where these macros are used, and we pay if NMT is disabled.
The CURRENT_PC and CALLER_PC macros are used to create `NativeCallStack` objects out of thin air :
```
#define CURRENT_PC ((MemTracker::tracking_level() == NMT_detail) ? \
NativeCallStack(0) : NativeCallStack::empty_stack())
#define CALLER_PC ((MemTracker::tracking_level() == NMT_detail) ? \
NativeCallStack(1) : NativeCallStack::empty_stack())
```
and feed them to a callee routine, which usually has the argument defined via const reference, e.g. os::malloc:
```
void* os::malloc(size_t size, MEMFLAGS memflags, const NativeCallStack& stack);
```
In CURRENT|CALLER_PC, the left hand of the ':' operator handles the detail mode, when we actually do collect a stack. In that case, the stack sits on the thread stack as an automatic anonymous variable and is filled by the stack walker. That's all fine.
The right-hand of ':' handles the case when we don't want a stack. In that case, the intent is to hand down the reference to a pre-created "empty stack" singleton (NativeCallStack::empty_stack()).
However, that does not work as intended. The C++ compiler - at least gcc on linux - generates code that laboriously copies the content of the empty stack singleton onto the thread stack. It uses four SSE instructions - two 16byte loads + two 16byte moves (the NMT stacks are 4 pointer-sized slots containing PCs):
```
0000000000cb9a60 <_ZN2os6mallocEm8MEMFLAGS>:
...
# Load tracking level
cb9a77: 48 8d 1d 02 35 78 00 lea 0x783502(%rip),%rbx # 143cf80 <_ZN10MemTracker15_tracking_levelE>
cb9a7e: 8b 03 mov (%rbx),%eax
# detail (3) tracking?
cb9a80: 83 f8 03 cmp $0x3,%eax
# yes: go and collect callstack
cb9a83: 0f 84 57 01 00 00 je cb9be0 <_ZN2os6mallocEm8MEMFLAGS+0x180>
# no: copy the content of NativeCallStack::_empty_stack to the local stack, in 16 byte intervals:
cb9a89: 48 8d 05 30 44 78 00 lea 0x784430(%rip),%rax # 143dec0 <_ZN15NativeCallStack12_empty_stackE>
cb9a90: f3 0f 6f 00 movdqu (%rax),%xmm0
cb9a94: f3 0f 6f 48 10 movdqu 0x10(%rax),%xmm1
cb9a99: 0f 11 45 c0 movups %xmm0,-0x40(%rbp)
cb9a9d: 0f 11 4d d0 movups %xmm1,-0x30(%rbp)
...
# do the actual malloc:
cb9af8: e8 c3 40 5d ff callq 28dbc0 <malloc@plt>
# call MallocTracker::record_malloc() and hand down pointer to NMT stack (4th argument->RCX):
cb9b0f: 48 8d 4d c0 lea -0x40(%rbp),%rcx
...
cb9b19: e8 f2 b7 f3 ff callq bf5310 <_ZN13MallocTracker13record_mallocEPvm8MEMFLAGSRK15NativeCallStack>
```
This is completely unnecessary, since if NMT mode != detail, the stack is never used. This hits every call site where these macros are used, and we pay if NMT is disabled.
- backported by
-
JDK-8317304 NMT incurs costs if disabled
-
- Resolved
-
-
JDK-8327524 NMT incurs costs if disabled
-
- Resolved
-
- relates to
-
JDK-8296436 NMT level does not need to be volatile
-
- Closed
-
- links to
-
Commit openjdk/jdk17u-dev/174c3291
-
Commit openjdk/jdk/9f8b6d2a
-
Review openjdk/jdk17u-dev/1808
-
Review openjdk/jdk/11040
(2 links to)