ZGC: NUMA-Affinity for Worker Threads in the Relocation Phase

XMLWordPrintable

    • Type: Bug
    • Resolution: Unresolved
    • Priority: P4
    • 26
    • Affects Version/s: 26
    • Component/s: hotspot
    • gc

      We have observed a performance regression after JDK-8359683 in SPECjbb2015 running in a small environment and with a relatively small heap size on a NUMA machine with two NUMA nodes. There is about a 1% regression with a 16GB heap (-Xms16G -Xmx16G) on 16 cores spread evenly across two NUMA nodes. The regression is still noticeable with 32 cores evenly spread across two NUMA nodes, somewhere between 0.5-1%.

      We've narrowed down the cause to two contributing factors after JDK-8359683. Since we now scale the number of relocation targets (pages that objects are relocated to in the relocation phase) with the number of NUMA nodes on the system, we need to allocate more pages, increasing the allocation pressure, which in turn impacts the application's throughput negatively. This effect is especially noticeable with small heaps, where the headroom for the GC is also small.

      The other cause is that when running on a relatively low number of cores, the number of GC workers that are chosen is low. The heuristics choose 25% of the available cores, so 4 workers when running on 16 cores. With few worker threads we are not guaranteed to have threads running on CPUs belonging to all NUMA nodes on the system, making some threads work on remote memory, which is particularly slow. For most Major collections on the observed runs we only choose a single thread for the Old collection, which is guaranteed to be bad if we have memory on multiple NUMA nodes.

      We can't do much about the increased allocation pressure due to the increased size of the relocation targets as it's part of the core design. However, there is opportunity for improvement in how we could influence where our worker threads are scheduled in the operating system. libnuma provides an API to set the affinity of threads to specific CPUs belonging to specific NUMA nodes. Utilizing this API, we can set the affinity of worker threads when they relocate pages so that they are always working on NUMA-local memory, resulting in the best possible memory access times. This approach is limited to the relocation phase, and before completing the relocation phase, the workers will have their affinity "stripped", so they can be scheduled on any CPU that the JVM is allowed to run on. This approach fixes the regression and also brings in an extra performance boost, resulting in a 2% speedup with the regression as baseline, or ~1% improvement from before JDK-8359683.

            Assignee:
            Joel Sikström
            Reporter:
            Joel Sikström
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: