Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8330275

Crash in XMark::follow_array

XMLWordPrintable

    • gc
    • b22

        TL;DR:

        On Arm64, with a heap > 1TB and on a machine with 48 bit address space, XAddressOffsetBits can be 45. This can cause followup crashes during marking because code implicitly expects XAddressOffsetBits <= 44.

        Details:

        We have multiple reports of crashes in ZGC (in both generational and non-generational mode) running on Java 21.

        Non-generational mode crash:

        Current thread (0x0000ffff9809cfc0): WorkerThread "XWorker#13" [id=176319, stack(0x0000ffff0e9ca000,0x0000ffff0ebc8000) (2040K)]

        Stack: [0x0000ffff0e9ca000,0x0000ffff0ebc8000], sp=0x0000ffff0ebc0550, free space=2009k
        Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
        V [libjvm.so+0xeabf5c] XMark::follow_array(unsigned long, unsigned long, bool) [clone .part.0]+0x6c
        V [libjvm.so+0xead76c] XMark::work_without_timeout(XMarkContext*)+0xac
        V [libjvm.so+0xeae378] XMark::work(unsigned long)+0xb8
        V [libjvm.so+0xed2148] XTask::Task::work(unsigned int)+0x28
        V [libjvm.so+0xe931ec] WorkerThread::run()+0xac
        V [libjvm.so+0xde927c] Thread::call_run()+0xbc
        V [libjvm.so+0xbe3c3c] thread_native_entry(Thread*)+0xdc
        C [libc.so.6+0x82a38] start_thread+0x2d4

        siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0x000022a2379b4000

        -----------------------------------

        Generational mode crash:

        Current thread (0x0000ffff941315e0): WorkerThread "ZWorkerYoung#4" [id=87563, stack(0x0000fffea1d22000,0x0000fffea1f20000) (2040K)]

        Stack: [0x0000fffea1d22000,0x0000fffea1f20000], sp=0x0000fffea1f18350, free space=2008k
        Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
        V [libjvm.so+0xf08858] mark_barrier_on_oop_array(zpointer volatile*, unsigned long, bool, bool)+0x48
        V [libjvm.so+0xf0aed8] ZMark::drain(ZMarkContext*)+0xd8
        V [libjvm.so+0xf0b070] ZMark::follow_work(bool)+0xfc
        V [libjvm.so+0xf35214] ZRememberedScanMarkFollowTask::work_inner()+0xb4
        V [libjvm.so+0xf35064] ZRememberedScanMarkFollowTask::work()+0x24
        V [libjvm.so+0xe931ec] WorkerThread::run()+0xac
        V [libjvm.so+0xde927c] Thread::call_run()+0xbc
        V [libjvm.so+0xbe3c3c] thread_native_entry(Thread*)+0xdc
        C [libc.so.6+0x82a38] start_thread+0x2d4

        siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0x0000229851cd4000

        ---------------------------------------

        With fastdebug build, it results in assertion failure:

        #
        # A fatal error has been detected by the Java Runtime Environment:
        #
        # Internal Error (/builddir/build/BUILD/java-21-openjdk-21.0.2.0.13-1.el7_9.aarch64/jdk-21.0.2+13/src/hotspot/share/gc/x/xBitField.hpp:76), pid=1517706, tid=1517826
        # assert(((ContainerType)value & (FieldMask << ValueShift)) == (ContainerType)value) failed: Invalid value
        #
        # JRE version: OpenJDK Runtime Environment (Red_Hat-21.0.2.0.13-1) (21.0.2+13) (fastdebug build 21.0.2+13-LTS)
        # Java VM: OpenJDK 64-Bit Server VM (Red_Hat-21.0.2.0.13-1) (fastdebug 21.0.2+13-LTS, mixed mode, tiered, compressed class ptrs, z gc, linux-aarch64)
        # Problematic frame:
        # V [libjvm.so+0x195a8bc] XMark::push_partial_array(unsigned long, unsigned long, bool)+0x24c

        Current thread (0x0000ffff940bb320): WorkerThread "XWorker#30" [id=1517826, stack(0x0000ffff084ce000,0x0000ffff086cc000) (2040K)]
         
        Stack: [0x0000ffff084ce000,0x0000ffff086cc000], sp=0x0000ffff086c44c0, free space=2009k
        Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
        V [libjvm.so+0x195a8bc] XMark::push_partial_array(unsigned long, unsigned long, bool)+0x24c (xBitField.hpp:76)
        V [libjvm.so+0x195b034] XMark::follow_large_array(unsigned long, unsigned long, bool)+0x110
        V [libjvm.so+0x195b740] XMark::mark_and_follow(XMarkContext*, XMarkStackEntry)+0x300
        V [libjvm.so+0x195bf6c] XMark::work_without_timeout(XMarkContext*)+0xcc
        V [libjvm.so+0x195c2e8] XMark::work(unsigned long)+0x164
        V [libjvm.so+0x1998bf8] XTask::Task::work(unsigned int)+0x28
        V [libjvm.so+0x19213a0] WorkerThread::run()+0xac
        V [libjvm.so+0x17ea604] Thread::call_run()+0xb0
        V [libjvm.so+0x141dc68] thread_native_entry(Thread*)+0x138
        C [libc.so.6+0x82a38] start_thread+0x2d4

        -----------------------------------
        Some other command line flags used are:

          -Xmx1200G -Xms1200G -XX:+UseLargePages -XX:+UseTransparentHugePages -XX:SoftMaxHeapSize=840G -XX:+AlwaysPreTouch

        Stack traces indicate the problem happens during marking of partial arrays.
        Looking at the the assertion failure in the debug build, it appears the "value" being encoded has extra bits set than the expected.

          static ContainerType encode(ValueType value) {
            assert(((ContainerType)value & (FieldMask << ValueShift)) == (ContainerType)value, "Invalid value");
            return ((ContainerType)value >> ValueShift) << FieldShift;
          }

        The "value" parameter is the address of the partial array being pushed to the mark stack in XMark::push_partial_array():

        void XMark::push_partial_array(uintptr_t addr, size_t size, bool finalizable) {
          ...
          const uintptr_t offset = XAddress::offset(addr) >> XMarkPartialArrayMinSizeShift;
          const uintptr_t length = size / oopSize;
          const XMarkStackEntry entry(offset, length, finalizable);
          ...
        }

        XAddress::offset(value) returns (value & XAddressOffsetMask).

        XAddressOffsetMask is a platform dependent value calculated in XAddress::initialize() as:

          XAddressOffsetBits = XPlatformAddressOffsetBits();
          XAddressOffsetMask = (((uintptr_t)1 << XAddressOffsetBits) - 1) << XAddressOffsetShift;

        XPlatformAddressOffsetBits() for aarch64 in this case returns 45. See the calculations below:

        size_t XPlatformAddressOffsetBits() {
          const static size_t valid_max_address_offset_bits = probe_valid_max_address_bit() + 1; // 47 + 1 = 48 (value of probe_valid_max_address_bit() is present in error logs)
          const size_t max_address_offset_bits = valid_max_address_offset_bits - 3; // 48 - 3 = 45
          const size_t min_address_offset_bits = max_address_offset_bits - 2; // 45 - 2 = 43
          const size_t address_offset = round_up_power_of_2(MaxHeapSize * XVirtualToPhysicalRatio); // MaxHeapSize = 1200GB, XVirtualToPhysicalRatio = 16 so address_offset = 2^45
          const size_t address_offset_bits = log2i_exact(address_offset); // address_offset_bits = 45
          return clamp(address_offset_bits, min_address_offset_bits, max_address_offset_bits); // returns min(max(address_offset_bits, min_address_offset_bits), max_address_offset_bits) = 45
        }

        So,
          XAddressOffsetBits = 45
          XAddressOffsetShift = 0
          which implies XAddressOffsetMask = 0x0000_1FFF_FFFF_FFFF

        So XAddress::offset(addr) returns the least significant 45 bits of the address.
        XMarkPartialArrayMinSizeShift is 12, therefore "offset" in XMark::push_partial_array() is set to last 33 bits of the address.

        But the encoding of the offset in XMarkStackEntry indicates only 32-bits are used, which means we are discarding the bit 33:

          typedef XBitField<uint64_t, size_t, 32, 32> field_partial_array_offset;

        If the partial array address happens to have bit 45 set then its encoding would result in losing the MSB, and this can trigger the assertion we are seeing with the debug build.

        It can also explain the other crashes seen with release build. Those crashes happen when a partial array is being marked.
        Because the partial array is encoded incorrectly (MSB is lost), when the offset is decoded later, it returns invalid address.
        Trying to deference it results in a crash.

              asmehra Ashutosh Mehra
              asmehra Ashutosh Mehra
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: