Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8296437

NMT incurs costs if disabled

XMLWordPrintable

    • b25

        While investigating the performance of the os::malloc wrapper, I noticed that we spend a lot of cycles copying empty callstacks around, even if NMT is disabled.

        The CURRENT_PC and CALLER_PC macros are used to create `NativeCallStack` objects out of thin air :

        ```
        #define CURRENT_PC ((MemTracker::tracking_level() == NMT_detail) ? \
                            NativeCallStack(0) : NativeCallStack::empty_stack())
        #define CALLER_PC ((MemTracker::tracking_level() == NMT_detail) ? \
                            NativeCallStack(1) : NativeCallStack::empty_stack())
        ```

        and feed them to a callee routine, which usually has the argument defined via const reference, e.g. os::malloc:

        ```
        void* os::malloc(size_t size, MEMFLAGS memflags, const NativeCallStack& stack);
        ```

        In CURRENT|CALLER_PC, the left hand of the ':' operator handles the detail mode, when we actually do collect a stack. In that case, the stack sits on the thread stack as an automatic anonymous variable and is filled by the stack walker. That's all fine.

        The right-hand of ':' handles the case when we don't want a stack. In that case, the intent is to hand down the reference to a pre-created "empty stack" singleton (NativeCallStack::empty_stack()).

        However, that does not work as intended. The C++ compiler - at least gcc on linux - generates code that laboriously copies the content of the empty stack singleton onto the thread stack. It uses four SSE instructions - two 16byte loads + two 16byte moves (the NMT stacks are 4 pointer-sized slots containing PCs):

        ```
        0000000000cb9a60 <_ZN2os6mallocEm8MEMFLAGS>:
        ...
        # Load tracking level
          cb9a77: 48 8d 1d 02 35 78 00 lea 0x783502(%rip),%rbx # 143cf80 <_ZN10MemTracker15_tracking_levelE>
          cb9a7e: 8b 03 mov (%rbx),%eax
        # detail (3) tracking?
          cb9a80: 83 f8 03 cmp $0x3,%eax
        # yes: go and collect callstack
          cb9a83: 0f 84 57 01 00 00 je cb9be0 <_ZN2os6mallocEm8MEMFLAGS+0x180>
        # no: copy the content of NativeCallStack::_empty_stack to the local stack, in 16 byte intervals:
          cb9a89: 48 8d 05 30 44 78 00 lea 0x784430(%rip),%rax # 143dec0 <_ZN15NativeCallStack12_empty_stackE>
          cb9a90: f3 0f 6f 00 movdqu (%rax),%xmm0
          cb9a94: f3 0f 6f 48 10 movdqu 0x10(%rax),%xmm1
          cb9a99: 0f 11 45 c0 movups %xmm0,-0x40(%rbp)
          cb9a9d: 0f 11 4d d0 movups %xmm1,-0x30(%rbp)
          ...
        # do the actual malloc:
          cb9af8: e8 c3 40 5d ff callq 28dbc0 <malloc@plt>

        # call MallocTracker::record_malloc() and hand down pointer to NMT stack (4th argument->RCX):
          cb9b0f: 48 8d 4d c0 lea -0x40(%rbp),%rcx
          ...
          cb9b19: e8 f2 b7 f3 ff callq bf5310 <_ZN13MallocTracker13record_mallocEPvm8MEMFLAGSRK15NativeCallStack>
        ```

        This is completely unnecessary, since if NMT mode != detail, the stack is never used. This hits every call site where these macros are used, and we pay if NMT is disabled.

              stuefe Thomas Stuefe
              stuefe Thomas Stuefe
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: