Loading...

Type: Enhancement
Resolution: Fixed
Priority: P4
Fix Version/s: 9
Affects Version/s: None
Component/s: hotspot
Labels:
None

Subcomponent:
compiler
Resolved In Build:
b84
CPU:

aarch64
OS:

generic

Issue	Fix Version	Assignee	Priority	Status	Resolution	Resolved In Build
JDK-8141949	emb-9	Andrew Haley	P4	Resolved	Fixed	team

In several places we emit DMB instructions unnecessarily.

Wei Tang <wei.tang@linaro.org>:

In current aarch64 platform C2 compiler, it inserts DMB instruction after
lock acquire and before lock release

One DMB post-dominates biased lock, thin lock, and inflated lock
acquisition blocks before entering the critical region,

and another DMB dominates all the successor blocks of the critical region.
In some paths, the DMB is un-necessary and impact performance.

First of all, we think a CompareAndSwap implemented using
load-acquire/store-release like following (already implemented in
MacroAssembler::cmpxchgptr),

when used to acquire or release a lock, is sufficient to be a barrier
instead of an explicit DMB. Sure please help review this point.

cmpxchgptr (oldv, newv, addr) sequence：

L_retry:

        ldaxr tmp, addr

        cmp tmp, oldv

        bne L_nope

        stlxr tmp, newv, addr

        cbzw tmp, L_succeed

        b L_retry

L_nope:

Similar code snippet can be found in ARM® Architecture Reference Manual
(DDI0487A_g, J10.3.1 Accquiring a lock).

*Path1 & 2 - biased locking/unlocking*

Locking:

Path 1 - When the lock object is biased to current thread, DMB is
un-necessary as current thread is holding the lock.

Path 2 - When the lock object is not biased to current thread, rebias takes
place:

If UseOptoBiasInlining is true, rebias is implemented with
StoreXConditional, which is mapped to aarch64_enc_cmpxchg in aarch64.ad
file.

Instruction ldxr used in CompareAndSwap sequence in aarch64_enc_cmpxchg has
no barrier effect so we create a new aarch64_enc_cmpxchg_acq

The change is same as DMB patch submitted recently (
http://cr.openjdk.java.net/~adinn/8080293/webrev.00/) to replace ldxr
with ldaxr serving as a barrier in CompareAndSwap sequence.

If UseOptoBiasInlining is false, MacroAssembler::biased_locking_enter is
invoked to acquire lock. It already has load-acquire/store-release

and is safe without explicit DMB.

Unlocking:

There is no-op in biased unlocking, so no special handling is needed.

*Path 3 – Thin lock/unlock*

Locking:

Thin lock acquire is implemented in aarch64_enc_fast_lock, it uses a simple
CAS sequence without generating any barrier. It depends on DMB barrier
generated

by membar_acquire_lock inserted in GraphKit::shared_lock. As described
above, load-acquire/store-release pair is sufficient to serve as a barrier

instead of an explicit DMB, so we suggest using ldaxr-stlxr pair as
following code shows:

L_retry:

        ldaxr r1, obj->markOOP // change ldxr to ldaxr

        cmp r1, unlocked_markword

        bne thin_lock_fail

        stlxr tmp, lock_record_address, obj->markOOP

        cbzw tmp, L_cont

        b L_retry

L_cont

Unlocking:

Thin lock release is implemented in aarch64_enc_fast_unlock as the first
code snippet shows. We think ldxr-stlxr pair is enough for locking release
and no special handling is needed after removing DMB.

L_retry:

        ldxr r1, obj->markOOP

        cmp r1, lock_record_address

        bne thin_lock_fail

        stlxr disp_hdr, obj->markOOP

        bne L_retry

*Path 4 ObjectMonitor lock/unlock*

Locking:

In ObjectMonitor lock, it invokes corresponding VM function
SharedRuntime::complete_monitor_locking_C. Base on our investigation, we
find all lock

acquire operation is achieved by Atomic::cmpxchg_ptr which calls native
function __sync_val_compare_and_swap extended to following code. The
load-acquire/store-release

pair is sufficient to serve as barrier.

a) Atomic::cmpxchg_ptr calls __sync_val_compare_and_swap

.L3:

        ldaxr x0, [x1]

        cmp x0, x2

        bne .L4

        stlxr w4, x3, [x1]

        cbnz w4, .L3

.L4:

Unlocking:

In ObjectMonitor lock, it invokes corresponding VM function
SharedRuntime::complete_monitor_unlocking_C. All release operation is
achieved by Atomic::cmpxchg_ptr

and OrderAccess::release_store_ptr + OrderAccess::storeload.
Atomic::cmpxchg_ptr has been mentioned above. On aarch64 platform,
OrderAccess::release_store_ptr + OrderAccess::storeload are mapped to stlr
and DMB instructions. Those two are enough to serve as barrier during lock
release.

a) Atomic::cmpxchg_ptr calls __sync_val_compare_and_swap

       Same as above

b) OrderAccess::release_store_ptr calls __atomic_store

       stlr x1, [x0]

c) OrderAccess::storeload() is same with OrderAccess::fence(), they calls
__sync_synchronize

       dmb ish

backported by

JDK-8141949 DMB elimination in AArch64 C2 synchronization implementation

Resolved

Details

Backports

Description

Attachments

Issue Links

Activity

People

Dates