Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8135157

DMB elimination in AArch64 C2 synchronization implementation

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Fixed
    • Icon: P4 P4
    • 9
    • None
    • hotspot
    • None
    • b84
    • aarch64
    • generic

        In several places we emit DMB instructions unnecessarily.

        Wei Tang <wei.tang@linaro.org>:

        In current aarch64 platform C2 compiler, it inserts DMB instruction after
        lock acquire and before lock release

        One DMB post-dominates biased lock, thin lock, and inflated lock
        acquisition blocks before entering the critical region,

        and another DMB dominates all the successor blocks of the critical region.
        In some paths, the DMB is un-necessary and impact performance.


        First of all, we think a CompareAndSwap implemented using
        load-acquire/store-release like following (already implemented in
        MacroAssembler::cmpxchgptr),

        when used to acquire or release a lock, is sufficient to be a barrier
        instead of an explicit DMB. Sure please help review this point.


        cmpxchgptr (oldv, newv, addr) sequence:

        L_retry:

                ldaxr tmp, addr

                cmp tmp, oldv

                bne L_nope

                stlxr tmp, newv, addr

                cbzw tmp, L_succeed

                b L_retry

        L_nope:

        Similar code snippet can be found in ARM® Architecture Reference Manual
        (DDI0487A_g, J10.3.1 Accquiring a lock).



        *Path1 & 2 - biased locking/unlocking*

        Locking:

        Path 1 - When the lock object is biased to current thread, DMB is
        un-necessary as current thread is holding the lock.

        Path 2 - When the lock object is not biased to current thread, rebias takes
        place:

        If UseOptoBiasInlining is true, rebias is implemented with
        StoreXConditional, which is mapped to aarch64_enc_cmpxchg in aarch64.ad
         file.

        Instruction ldxr used in CompareAndSwap sequence in aarch64_enc_cmpxchg has
        no barrier effect so we create a new aarch64_enc_cmpxchg_acq

        The change is same as DMB patch submitted recently (
        http://cr.openjdk.java.net/~adinn/8080293/webrev.00/) to replace ldxr
        with ldaxr serving as a barrier in CompareAndSwap sequence.

        If UseOptoBiasInlining is false, MacroAssembler::biased_locking_enter is
        invoked to acquire lock. It already has load-acquire/store-release

        and is safe without explicit DMB.

        Unlocking:

        There is no-op in biased unlocking, so no special handling is needed.



        *Path 3 – Thin lock/unlock*

        Locking:

        Thin lock acquire is implemented in aarch64_enc_fast_lock, it uses a simple
        CAS sequence without generating any barrier. It depends on DMB barrier
        generated

        by membar_acquire_lock inserted in GraphKit::shared_lock. As described
        above, load-acquire/store-release pair is sufficient to serve as a barrier

        instead of an explicit DMB, so we suggest using ldaxr-stlxr pair as
        following code shows:

        L_retry:

                ldaxr r1, obj->markOOP // change ldxr to ldaxr

                cmp r1, unlocked_markword

                bne thin_lock_fail

                stlxr tmp, lock_record_address, obj->markOOP

                cbzw tmp, L_cont

                b L_retry

        L_cont

        Unlocking:

        Thin lock release is implemented in aarch64_enc_fast_unlock as the first
        code snippet shows. We think ldxr-stlxr pair is enough for locking release
        and no special handling is needed after removing DMB.

         L_retry:

                ldxr r1, obj->markOOP

                cmp r1, lock_record_address

                bne thin_lock_fail

                stlxr disp_hdr, obj->markOOP

                bne L_retry


        *Path 4 ObjectMonitor lock/unlock*

        Locking:

        In ObjectMonitor lock, it invokes corresponding VM function
        SharedRuntime::complete_monitor_locking_C. Base on our investigation, we
        find all lock

        acquire operation is achieved by Atomic::cmpxchg_ptr which calls native
        function __sync_val_compare_and_swap extended to following code. The
        load-acquire/store-release

        pair is sufficient to serve as barrier.

        a) Atomic::cmpxchg_ptr calls __sync_val_compare_and_swap

        .L3:

                ldaxr x0, [x1]

                cmp x0, x2

                bne .L4

                stlxr w4, x3, [x1]

                cbnz w4, .L3

        .L4:



        Unlocking:

        In ObjectMonitor lock, it invokes corresponding VM function
        SharedRuntime::complete_monitor_unlocking_C. All release operation is
        achieved by Atomic::cmpxchg_ptr

        and OrderAccess::release_store_ptr + OrderAccess::storeload.
        Atomic::cmpxchg_ptr has been mentioned above. On aarch64 platform,
        OrderAccess::release_store_ptr + OrderAccess::storeload are mapped to stlr
        and DMB instructions. Those two are enough to serve as barrier during lock
        release.

        a) Atomic::cmpxchg_ptr calls __sync_val_compare_and_swap

               Same as above

        b) OrderAccess::release_store_ptr calls __atomic_store

               stlr x1, [x0]

        c) OrderAccess::storeload() is same with OrderAccess::fence(), they calls
        __sync_synchronize

               dmb ish

              aph Andrew Haley
              aph Andrew Haley
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: