Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8213827

NUMA heap allocation does not respect process membind/interleave settings

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Fixed
    • Icon: P4 P4
    • 13
    • 12
    • hotspot
    • gc
    • b05
    • x86
    • linux

      NUMA interleaving of memory of old gen and survivor spaces (for parallel GC) tells the OS to interleave memory across all nodes of a NUMA system. However the VM process may be configured to be limited to run only on a few nodes, which means that large parts of the heap will be located on foreign nodes. This can incur a large performance penalty.

      The proposed solution is to tell the OS to interleave memory only across available nodes when enabling NUMA

      Below describes the situation in more detail for when both numactl --membind and --interleave options are used. Addresses from both GC log and process numa maps clearly show that the JAVA process is configured to access other memory nodes even though it should not do so.

      The two relevant scenarios are:

               Case 1. Using numactl --membind only

      To show: Numa map shows these regions are INTERLEAVED ON ALL NODES instead of specified Numa memory nodes 0 and 1.

      $ numactl -m 0-1, -N 0-1 <command and its arguments>

      GC Log output
      -------------------

      GC Options: -Xlog:gc*=info,gc+heap=debug -XX:+UseParallelGC

      [602.180s][debug][gc,heap ] GC(20) Heap before GC invocations=21
      (full 4): PSYoungGen total 120604672K, used 11907587K [0x00002afc4b200000, 0x00002b198b200000, 0x00002b198b200000)
      [602.180s][debug][gc,heap ] GC(20) eden space 118525952K, 8% used [0x00002afc4b200000,0x00002b0bb1b376e0,0x00002b188d600000)
      [602.180s][debug][gc,heap ] GC(20) lgrp 0 space 59262976K, 8% used [0x00002afc4b200000,0x00002afd89bef450,0x00002b0a6c400000)
      [602.180s][debug][gc,heap ] GC(20) lgrp 1 space 59262976K, 8% used [0x00002b0a6c400000,0x00002b0bb1b376e0,0x00002b188d600000)
      [602.180s][debug][gc,heap ] GC(20) from space 2078720K, 65% used [0x00002b190c400000,0x00002b195ef5a0d0,0x00002b198b200000)
      [602.180s][debug][gc,heap ] GC(20) to space 2078720K, 0% used [0x00002b188d600000,0x00002b188d600000,0x00002b190c400000)
      [602.180s][debug][gc,heap ] GC(20) ParOldGen total 2097152K, used 226685K [0x00002afbcb200000, 0x00002afc4b200000, 0x00002afc4b200000)
      [602.180s][debug][gc,heap ] GC(20) object space 2097152K, 10% used [0x00002afbcb200000,0x00002afbd8f5f460,0x00002afc4b200000)
      [602.180s][debug][gc,heap ] GC(20) Metaspace used 28462K, capacity 29008K, committed 29184K, reserved 30720K

      Operating system output:
      ---------------------------------

      $ cat /proc/4947/status

      Cpus_allowed: 00000000,0000ffff,00000000,0000ffff
      Cpus_allowed_list: 0-15,64-79
      Mems_allowed:
      00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff
      Mems_allowed_list: 0-7

      I.e. two CPU nodes active, but all eight memory nodes

      $ cat /proc/4947/numa_maps

          =======> Following addresses are interleaved on all nodes.

      2afbb4f4c000 interleave:0-7 anon=16 dirty=16 N0=2 N1=2 N2=2 N3=2 N4=2 N5=2 N6=2 N7=2 kernelpagesize_kB=4
      2afbb4f6c000 interleave:0-7
      2afbb7e88000 interleave:0-7 anon=50 dirty=50 N0=7 N1=7 N2=6 N3=6 N4=6 N5=6 N6=6 N7=6 kernelpagesize_kB=4
      2afbbc000000 interleave:0-7 anon=8704 dirty=8704 N0=1600 N1=1088 N2=1088 N3=576 N4=1088 N5=1088 N6=1088 N7=1088 kernelpagesize_kB=4
      2afbc3be6000 interleave:0-7 anon=6682 dirty=6682 N0=1027 N1=1027 N2=515 N3=515 N4=515 N5=1027 N6=1028 N7=1028 kernelpagesize_kB=4
      2afbcb000000 interleave:0-7 anon=50 dirty=50 N0=7 N1=7 N2=6 N3=6 N4=6 N5=6 N6=6 N7=6 kernelpagesize_kB=4
      2afbcb200000 interleave:0-7 anon=524288 dirty=524288 N0=65536 N1=65536 N2=65536 N3=65536 N4=65536 N5=65536 N6=65536 N7=65536 kernelpagesize_kB=4

                  ==> OLD GEN Address

      2afc4b200000 prefer:0 anon=1536 dirty=1536 N0=1536 kernelpagesize_kB=4
      2b0a6c400000 prefer:1 anon=512 dirty=512 N1=512 kernelpagesize_kB=4
      2b188d600000 interleave:0-7 anon=1040384 dirty=1040384 N0=130048 N1=130048 N2=130048 N3=130048 N4=130048 N5=130048 N6=130048 N7=130048 kernelpagesize_kB=4

                  ==> Survivor Region

      2b198b600000 interleave:0-7 anon=60929 dirty=60929 N0=7233 N1=7744 N2=7232 N3=7744 N4=7744 N5=7744 N6=7744 N7=7744 kernelpagesize_kB=4

      ------------------------------------------------------------------------------------------------------------------

            Case 2. Using numactl --interleave only:

      $ numactl -i 0-1, -N 0-1 <command and its arguments>

      To show: NUMA maps below shows memory interleaved on all nodes instead of specified Numa memory nodes 0 and 1.

      GC log output:
      -------------------

      GC Options: -Xlog:gc*=info,gc+heap=debug -XX:+UseParallelGC

      [2216.439s][debug][gc,heap ] GC(159) Heap before GC invocations=160 (full 9): PSYoungGen total 120604672K, used 30143454K [0x00002b9d47c00000, 0x00002bba87c00000, 0x00002bba87c00000)
      [2216.439s][debug][gc,heap ] GC(159) eden space 118525952K, 24% used [0x00002b9d47c00000,0x00002ba458400000,0x00002bb98a000000)
      [2216.439s][debug][gc,heap ] GC(159) lgrp 0 space 14815232K, 98% used [0x00002b9d47c00000,0x00002ba0beb87c90,0x00002ba0d0000000)
      [2216.439s][debug][gc,heap ] GC(159) lgrp 1 space 14815232K, 100% used [0x00002ba0d0000000,0x00002ba458400000,0x00002ba458400000)

             ==> Memory allocated on following nodes are unused.

      [2216.439s][debug][gc,heap ] GC(159) lgrp 2 space 14815232K, 0% used [0x00002ba458400000,0x00002ba458400000,0x00002ba7e0800000)
      [2216.439s][debug][gc,heap ] GC(159) lgrp 3 space 14815232K, 0% used [0x00002ba7e0800000,0x00002ba7e0800000,0x00002bab68c00000)
      [2216.439s][debug][gc,heap ] GC(159) lgrp 4 space 14815232K, 0% used [0x00002bab68c00000,0x00002bab68c00000,0x00002baef1000000)
      [2216.439s][debug][gc,heap ] GC(159) lgrp 5 space 14815232K, 0%used [0x00002baef1000000,0x00002baef1000000,0x00002bb279400000)
      [2216.439s][debug][gc,heap ] GC(159) lgrp 6 space 14815232K, 0% used [0x00002bb279400000,0x00002bb279400000,0x00002bb601800000)
      [2216.439s][debug][gc,heap ] GC(159) lgrp 7 space 14819328K, 0% used [0x00002bb601800000,0x00002bb601800000,0x00002bb98a000000)
      [2216.439s][debug][gc,heap ] GC(159) from space 2078720K, 38% used [0x00002bba08e00000,0x00002bba3976fb70,0x00002bba87c00000)
      [2216.439s][debug][gc,heap ] GC(159) to space 2078720K, 0% used [0x00002bb98a000000,0x00002bb98a000000,0x00002bba08e00000)
      [2216.439s][debug][gc,heap ] GC(159) ParOldGen total 2097152K, used 685229K [0x00002b9cc7c00000, 0x00002b9d47c00000, 0x00002b9d47c00000)
      [2216.439s][debug][gc,heap ] GC(159) object space 2097152K, 32% used [0x00002b9cc7c00000,0x00002b9cf192b6e8,0x00002b9d47c00000)
      [2216.439s][debug][gc,heap ] GC(159) Metaspace used 28753K, capacity 29257K, committed 29440K, reserved 30720K

      Operating system output:
      ---------------------------------

      $ cat /proc/<pid>/status

      Cpus_allowed: 00000000,0000ffff,00000000,0000ffff
      Cpus_allowed_list: 0-15,64-79
      Mems_allowed:
      00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff
      Mems_allowed_list: 0-7

      (Please note “Mems_allowed” and “Mems_allowed_list” list also shows incorrect range. This could be LIBNUMA issue in case of membind)

      $ cat /proc/<pid>/numa_maps

         ==> Following addresses are interleaved on all the nodes.
      2b9cb1992000 interleave:0-7 anon=16 dirty=16 N0=2 N1=2 N2=2 N3=2 N4=2 N5=2 N6=2 N7=2 kernelpagesize_kB=4
      2b9cb19b2000 interleave:0-7 2b9cb3e65000 interleave:0-7 anon=50 dirty=50 N0=6 N1=6 N2=6 N3=6 N4=6 N5=7 N6=7 N7=6 kernelpagesize_kB=4
      2b9cb8a69000 interleave:0-7 anon=8599 dirty=8599 N0=626 N1=1139 N2=1139 N3=1139 N4=1139 N5=1139 N6=1139 N7=1139 kernelpagesize_kB=4
      2b9cc064f000 interleave:0-7 anon=6577 dirty=6577 N0=566 N1=566 N2=566 N3=1078 N4=1078 N5=1078 N6=1078 N7=567 kernelpagesize_kB=4
      2b9cc7a69000 interleave:0-7 anon=50 dirty=50 N0=6 N1=7 N2=7 N3=6 N4=6 N5=6 N6=6 N7=6 kernelpagesize_kB=4
      2b9cc7c00000 interleave:0-7 anon=524288 dirty=524288 N0=65536 N1=65536 N2=65536 N3=65536 N4=65536 N5=65536 N6=65536 N7=65536 kernelpagesize_kB=4
      2b9d47c00000 prefer:0 anon=2560 dirty=2560 N0=2560 kernelpagesize_kB=4

         ==> Logical group 1

      2ba0d0000000 prefer:1

        ==> Logical group 2

      2ba458400000 prefer:2

        ==> This one and below all are unnecessary and leaving memory unused.

      2ba7e0800000 prefer:3
      2bab68c00000 prefer:4
      2baef1000000 prefer:5
      2bb279400000 prefer:6
      2bb601800000 prefer:7

      2bb98a000000 interleave:0-7 anon=1040384 dirty=1040384 N0=130048 N1=130048 N2=130048 N3=130048 N4=130048 N5=130048 N6=130048 N7=130048 kernelpagesize_kB=4
      2bba88000000 interleave:0-7 anon=60929 dirty=60929 N0=7745 N1=7744 N2=7744 N3=7744 N4=7744 N5=7232 N6=7744 N7=7232 kernelpagesize_kB=4


      Proposed patch:

      The patch gets the allowed cpu/memory nodes by calling following APIs (Man page definition for these functions are also given below).

       1. For Membind : Use numa_get_membind to get membind bitmask (already used in the code)

          "numa_get_membind() returns the mask of nodes from which memory can currently be allocated. If the returned mask is equal to numa_all_nodes, then memory allocation is allowed from all nodes."

       2. For Interleave: use numa_get_interleave_mask to get interleave mask (currently not used/called in JDK)

           "numa_get_interleave_mask() returns the current interleave mask if the task's memory allocation policy is page interleaved. Otherwise, this function returns an empty mask."

      The patch gets the node bitmasks from both methods to determine the current "mode". When the interleave mask is empty, then membind has been configured, interleave otherwise.

      A call to “numa_interleave_memory” (called indirectly through “numa_make_global” function) function with the correct bitmask identified above fixes this issue.

      Improvements:

      This patch is tested on EPYC with SPECJBB benchmark and score improvements are given below.

      1. With NUMACTL membind

          Max-jOPS improved by 5-12 % and Critical-jOPS by 2-6 %

      2. With NUMACTL interleave (This patch fixes memory usage when invoked with numactl -i)

          Max-jOPS by 5-15% and Critical-jOPS by 11-100%.

      3. With this fix, flag UseNUMAInterleaving turning on/off has no effect when externally interleaved through NUMACTL.

      4. Flag UseNUMA score improved by ~14 % when enabled for single NUMA node. Currently it gets disabled when externally bound to single node.

      Contributed by A. Pawar (http://mail.openjdk.java.net/pipermail/hotspot-dev/2018-November/035138.html)

            tschatzl Thomas Schatzl
            dholmes David Holmes
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: