Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8296437

NMT incurs costs if disabled

    XMLWordPrintable

Details

    • b25

    Backports

      Description

        While investigating the performance of the os::malloc wrapper, I noticed that we spend a lot of cycles copying empty callstacks around, even if NMT is disabled.

        The CURRENT_PC and CALLER_PC macros are used to create `NativeCallStack` objects out of thin air :

        ```
        #define CURRENT_PC ((MemTracker::tracking_level() == NMT_detail) ? \
                            NativeCallStack(0) : NativeCallStack::empty_stack())
        #define CALLER_PC ((MemTracker::tracking_level() == NMT_detail) ? \
                            NativeCallStack(1) : NativeCallStack::empty_stack())
        ```

        and feed them to a callee routine, which usually has the argument defined via const reference, e.g. os::malloc:

        ```
        void* os::malloc(size_t size, MEMFLAGS memflags, const NativeCallStack& stack);
        ```

        In CURRENT|CALLER_PC, the left hand of the ':' operator handles the detail mode, when we actually do collect a stack. In that case, the stack sits on the thread stack as an automatic anonymous variable and is filled by the stack walker. That's all fine.

        The right-hand of ':' handles the case when we don't want a stack. In that case, the intent is to hand down the reference to a pre-created "empty stack" singleton (NativeCallStack::empty_stack()).

        However, that does not work as intended. The C++ compiler - at least gcc on linux - generates code that laboriously copies the content of the empty stack singleton onto the thread stack. It uses four SSE instructions - two 16byte loads + two 16byte moves (the NMT stacks are 4 pointer-sized slots containing PCs):

        ```
        0000000000cb9a60 <_ZN2os6mallocEm8MEMFLAGS>:
        ...
        # Load tracking level
          cb9a77: 48 8d 1d 02 35 78 00 lea 0x783502(%rip),%rbx # 143cf80 <_ZN10MemTracker15_tracking_levelE>
          cb9a7e: 8b 03 mov (%rbx),%eax
        # detail (3) tracking?
          cb9a80: 83 f8 03 cmp $0x3,%eax
        # yes: go and collect callstack
          cb9a83: 0f 84 57 01 00 00 je cb9be0 <_ZN2os6mallocEm8MEMFLAGS+0x180>
        # no: copy the content of NativeCallStack::_empty_stack to the local stack, in 16 byte intervals:
          cb9a89: 48 8d 05 30 44 78 00 lea 0x784430(%rip),%rax # 143dec0 <_ZN15NativeCallStack12_empty_stackE>
          cb9a90: f3 0f 6f 00 movdqu (%rax),%xmm0
          cb9a94: f3 0f 6f 48 10 movdqu 0x10(%rax),%xmm1
          cb9a99: 0f 11 45 c0 movups %xmm0,-0x40(%rbp)
          cb9a9d: 0f 11 4d d0 movups %xmm1,-0x30(%rbp)
          ...
        # do the actual malloc:
          cb9af8: e8 c3 40 5d ff callq 28dbc0 <malloc@plt>

        # call MallocTracker::record_malloc() and hand down pointer to NMT stack (4th argument->RCX):
          cb9b0f: 48 8d 4d c0 lea -0x40(%rbp),%rcx
          ...
          cb9b19: e8 f2 b7 f3 ff callq bf5310 <_ZN13MallocTracker13record_mallocEPvm8MEMFLAGSRK15NativeCallStack>
        ```

        This is completely unnecessary, since if NMT mode != detail, the stack is never used. This hits every call site where these macros are used, and we pay if NMT is disabled.

        Attachments

          Issue Links

            Activity

              People

                stuefe Thomas Stuefe
                stuefe Thomas Stuefe
                Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved: