Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8351500

G1: NUMA migrations cause crashes in region allocation

XMLWordPrintable

    • gc
    • b15
    • linux

        (Note: This bug manifests on JDK 21 and 17; we don't see crashes or asserts on mainline JDK. but I argue that the underlying root issue is also in mainline JDK and would best be fixed there).

        One of our customers found that NUMA migrations (more precisely, the OS task getting scheduled to a different NUMA node) can cause G1 to crash if they happen at exactly the wrong moment.

        JVM runs with +UseNUMA +UseNUMAInterleaving, G1GC and 4TB heap, two or four NUMA nodes, about 5000 application threads and 159 GC worker threads. JVM crashes (rarely, about once every four hours or so).

        Call stacks wildly different, e.g.:

        ```
            28 Stack: [0x00007e506733f000,0x00007e5067540000], sp=0x00007e506753cf10, free space=2039k
            29 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
            30 V [libjvm.so+0xf32422] Symbol::as_klass_external_name() const+0x12 (symbol.hpp:140)
            31 V [libjvm.so+0xda71ff] SharedRuntime::generate_class_cast_message(Klass*, Klass*, Symbol*)+0x1f (sharedRuntime.cpp:2179)
            32 V [libjvm.so+0xda99c4] SharedRuntime::generate_class_cast_message(JavaThread*, Klass*)+0xd4 (sharedRuntime.cpp:2171)
            33 V [libjvm.so+0x578e2c] Runtime1::throw_class_cast_exception(JavaThread*, oopDesc*)+0x13c (c1_Runtime1.cpp:735)
        ```

        in some crashes, it looks like we load a zero from the heap where no zero should be (eg. as narrow Klass ID from an oop header).

        However, if you run a debug JVM, you usually see an assert either in G1Allocator or in CollectedHeap, for example

        ```
          27 Current thread (0x00007fb770087b70): JavaThread "Thread-33" [_thread_in_vm, id=123345, stack(0x00007fb7a86d7000,0x00007fb7a87d8000) (1028K)]
          28
          29 Stack: [0x00007fb7a86d7000,0x00007fb7a87d8000], sp=0x00007fb7a87d62f0, free space=1020k
          30 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
          31 V [libjvm.so+0x9fdd6b] CollectedHeap::fill_with_object_impl(HeapWordImpl**, unsigned long, bool) [clone .part.0]+0x2b (collectedHeap.cpp:470)
          32 V [libjvm.so+0x9fff1d] CollectedHeap::fill_with_object(HeapWordImpl**, unsigned long, bool)+0x39d (arrayOop.hpp:58)
          33 V [libjvm.so+0xc5009f] G1AllocRegion::fill_up_remaining_space(HeapRegion*)+0x1ef (g1AllocRegion.cpp:79)
          34 V [libjvm.so+0xc5027c] G1AllocRegion::retire_internal(HeapRegion*, bool)+0x6c (g1AllocRegion.cpp:106)
          35 V [libjvm.so+0xc51347] MutatorAllocRegion::retire(bool)+0xb7 (g1AllocRegion.cpp:300)
          36 V [libjvm.so+0xc50ed9] G1AllocRegion::new_alloc_region_and_allocate(unsigned long, bool)+0x59 (g1AllocRegion.cpp:139)
          37 V [libjvm.so+0xc9b140] G1CollectedHeap::attempt_allocation_slow(unsigned long)+0x6d0 (g1AllocRegion.inline.hpp:120)
          38 V [libjvm.so+0xc9e4ff] G1CollectedHeap::attempt_allocation(unsigned long, unsigned long, unsigned long*)+0x39f (g1CollectedHeap.cpp:643)
          39 V [libjvm.so+0xc9bd4f] G1CollectedHeap::mem_allocate(unsigned long, bool*)+0x5f (g1CollectedHeap.cpp:401)
          40 V [libjvm.so+0x13b9b6d] MemAllocator::mem_allocate_slow(MemAllocator::Allocation&) const+0x5d (memAllocator.cpp:240)
          41 V [libjvm.so+0x13b9ca1] MemAllocator::allocate() const+0xa1 (memAllocator.cpp:357)
        ```

        The problem is in `G1Allocator`. `G1AllocRegion` objects tied to NUMA nodes. For most actions involving the `G1Allocator`, we determine the `G1AllocRegion` of the current thread, then redirect the action toward that alloc region. However, due to OS scheduling the NUMA-to-thread-association can change arbitrarily. That means calls to `G1Allocator` are not guaranteed to hit the same `G1AllocRegion` object as last time.

        Now, we have control flows that assume that we work with the same `G1AllocRegion` object over their duration, since we build up state in `G1AllocRegion`. The JDK 21 control flow affected is:

        ```
        - `G1CollectedHeap::attempt_allocation_slow`
          - `G1Allocator::attempt_allocation_locked` (A)
            - `G1AllocRegion::attempt_allocation_locked`
              - `G1AllocRegion::attempt_allocation` (try again allocating from HeapRegion under lock protection); failing that:
              - `G1AllocRegion::attempt_allocation_using_new_region`
                - `G1AllocRegion::retire` (retires current allocation region; may keep it as retained region)
                - `G1AllocRegion::new_alloc_region_and_allocate` (allocate new HeapRegion and set it; failing that, sets dummy region), failing that:
          - `G1Allocator::attempt_allocation_force` (B)
            - `G1AllocRegion::attempt_allocation_force`
              - `G1AllocRegion::new_alloc_region_and_allocate`
        ```

        Here, if we change NUMA node from (A) to (B), we will address different `G1AllocRegion` objects. But `G1AllocRegion::attempt_allocation_force` assumes that the current allocation region for this object is retired, which is done by the preceding `G1AllocRegion::attempt_allocation_locked`, but for a different region.

        This causes us to abandon the current allocation region; it won't be added to the collection set. On debug JVMs, we hit one of two asserts. We either complain about the current allocation region being not dummy at the entrance of new_alloc_region_and_allocate; In JDK 17, we assert when retire the wrong region, and it is more empty than expected. The effect of this can be delayed, happening on the next retire, since it can affect the retained region.

        ----

        Reproduction and Regression testing

        Reproducing the bug is difficult. I did not have a NUMA machine, and even if I had one, NUMA task-node migrations are very rare. Therefore, I build something like a "FakeNUMA" mode which essentially interposes OS NUMA calls and fakes a NUMA system of 8 nodes. I also added a "FakeNUMAStressMigrations" mode mimicking frequent node migrations. With these simple tools, I could reproduce the customer problem (with gc/TestJNICriticalStressTest, slightly modified to increase the number of JNICritical threads). I plan to bring the FakeNUMA mode upstream, but have no time atm to polish it up.

              stuefe Thomas Stuefe
              stuefe Thomas Stuefe
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: