-
Enhancement
-
Resolution: Fixed
-
P3
-
11, 17
-
b10
-
aarch64
-
generic
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8263876 | 11.0.12 | Andrew Haley | P3 | Resolved | Fixed | b01 |
Go back a few years, and there were simple atomic load/store exclusive
instructions on Arm. Say you want to do an atomic increment of a
counter. You'd do an atomic load to get the counter into your local cache
in exclusive state, increment that counter locally, then write that
incremented counter back to memory with an atomic store. All the time
that cache line was in exclusive state, so you're guaranteed that
no-one else changed anything on that cache line while you had it.
This is hard to scale on a very large system (e.g. Fugaku) because if
many processors are incrementing that counter you get a lot of cache
line ping-ponging between cores.
So, Arm decided to add a locked memory increment instruction that
works without needing to load an entire line into local cache. It's a
single instruction that loads, increments, and writes back. The secret
is to send a cache control message to whichever processor owns the
cache line containing the count, tell that processor to increment the
counter and return the incremented value. That way cache coherency
traffic is mimimized. This new set of instructions is known as Large
System Extensions, or LSE.
Unfortunately, in recent processors, the "old" load/store exclusive
instructions, sometimes perform very badly. Therefore, it's now
necessary for software to detect which version of Arm it's running
on, and use the "new" LSE instructions if they're available. Otherwise
performance can be very poor under heavy contention.
GCC's -moutline-atomics does this by providing library calls which use
LSE if it's available, but this option is only provided on newer
versions of GCC. This is particularly problematic with older versions
of OpenJDK, which build using old GCC versions.
Also, I suspect that some other operating systems could use this.
Perhaps not MacOS, given that all Apple CPUs support LSE, but
maybe Windows.
instructions on Arm. Say you want to do an atomic increment of a
counter. You'd do an atomic load to get the counter into your local cache
in exclusive state, increment that counter locally, then write that
incremented counter back to memory with an atomic store. All the time
that cache line was in exclusive state, so you're guaranteed that
no-one else changed anything on that cache line while you had it.
This is hard to scale on a very large system (e.g. Fugaku) because if
many processors are incrementing that counter you get a lot of cache
line ping-ponging between cores.
So, Arm decided to add a locked memory increment instruction that
works without needing to load an entire line into local cache. It's a
single instruction that loads, increments, and writes back. The secret
is to send a cache control message to whichever processor owns the
cache line containing the count, tell that processor to increment the
counter and return the incremented value. That way cache coherency
traffic is mimimized. This new set of instructions is known as Large
System Extensions, or LSE.
Unfortunately, in recent processors, the "old" load/store exclusive
instructions, sometimes perform very badly. Therefore, it's now
necessary for software to detect which version of Arm it's running
on, and use the "new" LSE instructions if they're available. Otherwise
performance can be very poor under heavy contention.
GCC's -moutline-atomics does this by providing library calls which use
LSE if it's available, but this option is only provided on newer
versions of GCC. This is particularly problematic with older versions
of OpenJDK, which build using old GCC versions.
Also, I suspect that some other operating systems could use this.
Perhaps not MacOS, given that all Apple CPUs support LSE, but
maybe Windows.
- backported by
-
JDK-8263876 AArch64: Support for LSE atomics C++ HotSpot code
- Resolved
- blocks
-
JDK-8261649 AArch64: Optimize LSE atomics in C++ code
- Resolved
- relates to
-
JDK-8261659 JDK-8261027 causes a Tier1 validate-source failure
- Closed
-
JDK-8263541 Potential race in 8261027: AArch64: Support for LSE atomics C++ HotSpot code
- Closed
-
JDK-8261649 AArch64: Optimize LSE atomics in C++ code
- Resolved
-
JDK-8261660 AArch64: Race condition in stub code generation for LSE Atomics
- Closed
(1 relates to, 2 links to)