Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8141949 | emb-9 | Andrew Haley | P4 | Resolved | Fixed | team |
In several places we emit DMB instructions unnecessarily.
Wei Tang <wei.tang@linaro.org>:
In current aarch64 platform C2 compiler, it inserts DMB instruction after
lock acquire and before lock release
One DMB post-dominates biased lock, thin lock, and inflated lock
acquisition blocks before entering the critical region,
and another DMB dominates all the successor blocks of the critical region.
In some paths, the DMB is un-necessary and impact performance.
First of all, we think a CompareAndSwap implemented using
load-acquire/store-release like following (already implemented in
MacroAssembler::cmpxchgptr),
when used to acquire or release a lock, is sufficient to be a barrier
instead of an explicit DMB. Sure please help review this point.
cmpxchgptr (oldv, newv, addr) sequence:
L_retry:
ldaxr tmp, addr
cmp tmp, oldv
bne L_nope
stlxr tmp, newv, addr
cbzw tmp, L_succeed
b L_retry
L_nope:
Similar code snippet can be found in ARM® Architecture Reference Manual
(DDI0487A_g, J10.3.1 Accquiring a lock).
*Path1 & 2 - biased locking/unlocking*
Locking:
Path 1 - When the lock object is biased to current thread, DMB is
un-necessary as current thread is holding the lock.
Path 2 - When the lock object is not biased to current thread, rebias takes
place:
If UseOptoBiasInlining is true, rebias is implemented with
StoreXConditional, which is mapped to aarch64_enc_cmpxchg in aarch64.ad
file.
Instruction ldxr used in CompareAndSwap sequence in aarch64_enc_cmpxchg has
no barrier effect so we create a new aarch64_enc_cmpxchg_acq
The change is same as DMB patch submitted recently (
http://cr.openjdk.java.net/~adinn/8080293/webrev.00/) to replace ldxr
with ldaxr serving as a barrier in CompareAndSwap sequence.
If UseOptoBiasInlining is false, MacroAssembler::biased_locking_enter is
invoked to acquire lock. It already has load-acquire/store-release
and is safe without explicit DMB.
Unlocking:
There is no-op in biased unlocking, so no special handling is needed.
*Path 3 – Thin lock/unlock*
Locking:
Thin lock acquire is implemented in aarch64_enc_fast_lock, it uses a simple
CAS sequence without generating any barrier. It depends on DMB barrier
generated
by membar_acquire_lock inserted in GraphKit::shared_lock. As described
above, load-acquire/store-release pair is sufficient to serve as a barrier
instead of an explicit DMB, so we suggest using ldaxr-stlxr pair as
following code shows:
L_retry:
ldaxr r1, obj->markOOP // change ldxr to ldaxr
cmp r1, unlocked_markword
bne thin_lock_fail
stlxr tmp, lock_record_address, obj->markOOP
cbzw tmp, L_cont
b L_retry
L_cont
Unlocking:
Thin lock release is implemented in aarch64_enc_fast_unlock as the first
code snippet shows. We think ldxr-stlxr pair is enough for locking release
and no special handling is needed after removing DMB.
L_retry:
ldxr r1, obj->markOOP
cmp r1, lock_record_address
bne thin_lock_fail
stlxr disp_hdr, obj->markOOP
bne L_retry
*Path 4 ObjectMonitor lock/unlock*
Locking:
In ObjectMonitor lock, it invokes corresponding VM function
SharedRuntime::complete_monitor_locking_C. Base on our investigation, we
find all lock
acquire operation is achieved by Atomic::cmpxchg_ptr which calls native
function __sync_val_compare_and_swap extended to following code. The
load-acquire/store-release
pair is sufficient to serve as barrier.
a) Atomic::cmpxchg_ptr calls __sync_val_compare_and_swap
.L3:
ldaxr x0, [x1]
cmp x0, x2
bne .L4
stlxr w4, x3, [x1]
cbnz w4, .L3
.L4:
Unlocking:
In ObjectMonitor lock, it invokes corresponding VM function
SharedRuntime::complete_monitor_unlocking_C. All release operation is
achieved by Atomic::cmpxchg_ptr
and OrderAccess::release_store_ptr + OrderAccess::storeload.
Atomic::cmpxchg_ptr has been mentioned above. On aarch64 platform,
OrderAccess::release_store_ptr + OrderAccess::storeload are mapped to stlr
and DMB instructions. Those two are enough to serve as barrier during lock
release.
a) Atomic::cmpxchg_ptr calls __sync_val_compare_and_swap
Same as above
b) OrderAccess::release_store_ptr calls __atomic_store
stlr x1, [x0]
c) OrderAccess::storeload() is same with OrderAccess::fence(), they calls
__sync_synchronize
dmb ish
Wei Tang <wei.tang@linaro.org>:
In current aarch64 platform C2 compiler, it inserts DMB instruction after
lock acquire and before lock release
One DMB post-dominates biased lock, thin lock, and inflated lock
acquisition blocks before entering the critical region,
and another DMB dominates all the successor blocks of the critical region.
In some paths, the DMB is un-necessary and impact performance.
First of all, we think a CompareAndSwap implemented using
load-acquire/store-release like following (already implemented in
MacroAssembler::cmpxchgptr),
when used to acquire or release a lock, is sufficient to be a barrier
instead of an explicit DMB. Sure please help review this point.
cmpxchgptr (oldv, newv, addr) sequence:
L_retry:
ldaxr tmp, addr
cmp tmp, oldv
bne L_nope
stlxr tmp, newv, addr
cbzw tmp, L_succeed
b L_retry
L_nope:
Similar code snippet can be found in ARM® Architecture Reference Manual
(DDI0487A_g, J10.3.1 Accquiring a lock).
*Path1 & 2 - biased locking/unlocking*
Locking:
Path 1 - When the lock object is biased to current thread, DMB is
un-necessary as current thread is holding the lock.
Path 2 - When the lock object is not biased to current thread, rebias takes
place:
If UseOptoBiasInlining is true, rebias is implemented with
StoreXConditional, which is mapped to aarch64_enc_cmpxchg in aarch64.ad
file.
Instruction ldxr used in CompareAndSwap sequence in aarch64_enc_cmpxchg has
no barrier effect so we create a new aarch64_enc_cmpxchg_acq
The change is same as DMB patch submitted recently (
http://cr.openjdk.java.net/~adinn/8080293/webrev.00/) to replace ldxr
with ldaxr serving as a barrier in CompareAndSwap sequence.
If UseOptoBiasInlining is false, MacroAssembler::biased_locking_enter is
invoked to acquire lock. It already has load-acquire/store-release
and is safe without explicit DMB.
Unlocking:
There is no-op in biased unlocking, so no special handling is needed.
*Path 3 – Thin lock/unlock*
Locking:
Thin lock acquire is implemented in aarch64_enc_fast_lock, it uses a simple
CAS sequence without generating any barrier. It depends on DMB barrier
generated
by membar_acquire_lock inserted in GraphKit::shared_lock. As described
above, load-acquire/store-release pair is sufficient to serve as a barrier
instead of an explicit DMB, so we suggest using ldaxr-stlxr pair as
following code shows:
L_retry:
ldaxr r1, obj->markOOP // change ldxr to ldaxr
cmp r1, unlocked_markword
bne thin_lock_fail
stlxr tmp, lock_record_address, obj->markOOP
cbzw tmp, L_cont
b L_retry
L_cont
Unlocking:
Thin lock release is implemented in aarch64_enc_fast_unlock as the first
code snippet shows. We think ldxr-stlxr pair is enough for locking release
and no special handling is needed after removing DMB.
L_retry:
ldxr r1, obj->markOOP
cmp r1, lock_record_address
bne thin_lock_fail
stlxr disp_hdr, obj->markOOP
bne L_retry
*Path 4 ObjectMonitor lock/unlock*
Locking:
In ObjectMonitor lock, it invokes corresponding VM function
SharedRuntime::complete_monitor_locking_C. Base on our investigation, we
find all lock
acquire operation is achieved by Atomic::cmpxchg_ptr which calls native
function __sync_val_compare_and_swap extended to following code. The
load-acquire/store-release
pair is sufficient to serve as barrier.
a) Atomic::cmpxchg_ptr calls __sync_val_compare_and_swap
.L3:
ldaxr x0, [x1]
cmp x0, x2
bne .L4
stlxr w4, x3, [x1]
cbnz w4, .L3
.L4:
Unlocking:
In ObjectMonitor lock, it invokes corresponding VM function
SharedRuntime::complete_monitor_unlocking_C. All release operation is
achieved by Atomic::cmpxchg_ptr
and OrderAccess::release_store_ptr + OrderAccess::storeload.
Atomic::cmpxchg_ptr has been mentioned above. On aarch64 platform,
OrderAccess::release_store_ptr + OrderAccess::storeload are mapped to stlr
and DMB instructions. Those two are enough to serve as barrier during lock
release.
a) Atomic::cmpxchg_ptr calls __sync_val_compare_and_swap
Same as above
b) OrderAccess::release_store_ptr calls __atomic_store
stlr x1, [x0]
c) OrderAccess::storeload() is same with OrderAccess::fence(), they calls
__sync_synchronize
dmb ish
- backported by
-
JDK-8141949 DMB elimination in AArch64 C2 synchronization implementation
-
- Resolved
-