Loading...

Type: Bug
Resolution: Unresolved
Priority: P4
Fix Version/s: 27
Affects Version/s: 19
Component/s: hotspot
Labels:
Environment:

Pre AMD Zen 3

Subcomponent:
compiler
CPU:

x86

// Received via hotspot-compiler-dev@openjdk.org

Hi,

today I stumbled upon a performance issue with the Long.compress/expand and
Integer.compress/expand intrinsics on certain AMD processors. I discovered
this while working on an optimized varint decoder where I was hoping to use
Long.compress() to speed up bit extraction. Instead, I found my "optimized"
version was slower than my naive loop-based implementation. After some
digging, I believe I understand what's happening.

**Background**

The compress and expand methods (added in JDK 19 via ~~JDK-8283893~~ [1]) are
intrinsified by C2 to use the BMI2 PEXT and PDEP instructions when the CPU
reports BMI2 support.
This works great on Intel Haswell+ and AMD Zen 3+, where these instructions
execute in dedicated hardware with approximately 3-cycle latency.
However, AMD processors from Excavator before Zen 3 implement PEXT/PDEP via
microcode emulation rather than native hardware.
This is confirmed by AMD's Software Optimization Guide for Family 19h
Processors [2], Section 2.10.2, which states that Zen 3 has native ALU
support for these instructions.
Wikipedia's page on x86 Bit Manipulation Instruction Sets [3] also
documents this behavior:

> AMD processors before Zen 3 that implement PDEP and PEXT do so in
> microcode, with a latency of 18 cycles rather than (Zen 3) 3 cycles. As a
> result it is often faster to use other instructions on these processors.

**Reproducer**

Here is a JMH benchmark that demonstrates the issue by comparing the
intrinsified path against the software fallback using ControlIntrinsic
flags:

```
import org.openjdk.jmh.annotations.*;

import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@State(Scope.Benchmark)
public class PextPdepPerformanceBug {
    // I'm not using constants to prevent constant folding
    private long longValue;
    private long longMask;
    private int intValue;
    private int intMask;

    @Setup(Level.Iteration)
    public void setup() {
        var rng = ThreadLocalRandom.current();
        longValue = rng.nextLong();
        longMask = rng.nextLong();
        intValue = rng.nextInt();
        intMask = rng.nextInt();
    }

    // Long.compress (PEXT 64-bit)

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=-_compress_l",
        "-Xcomp"
    })
    public long compressLongSoftware() {
        return Long.compress(longValue, longMask);
    }

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=+_compress_l",
        "-Xcomp"
    })
    public long compressLongIntrinsic() {
        return Long.compress(longValue, longMask);
    }

    // Long.expand (PDEP 64-bit)

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=-_expand_l",
        "-Xcomp"
    })
    public long expandLongSoftware() {
        return Long.expand(longValue, longMask);
    }

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=+_expand_l",
        "-Xcomp"
    })
    public long expandLongIntrinsic() {
        return Long.expand(longValue, longMask);
    }

    // Integer.compress (PEXT 32-bit)

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=-_compress_i",
        "-Xcomp"
    })
    public int compressIntSoftware() {
        return Integer.compress(intValue, intMask);
    }

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=+_compress_i",
        "-Xcomp"
    })
    public int compressIntIntrinsic() {
        return Integer.compress(intValue, intMask);
    }

    // Integer.expand (PDEP 32-bit)

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=-_expand_i",
        "-Xcomp"
    })
    public int expandIntSoftware() {
        return Integer.expand(intValue, intMask);
    }

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=+_expand_i",
        "-Xcomp"
    })
    public int expandIntIntrinsic() {
        return Integer.expand(intValue, intMask);
    }
}
```

Here are the results on an i7 9700K, which supports the BMI2 instruction
set and is not affected by this issue:
```
Benchmark Mode Cnt Score Error
Units
PextPdepPerformanceBug.compressIntIntrinsic avgt 10 0,545 ± 0,002
ns/op
PextPdepPerformanceBug.compressIntSoftware avgt 10 11,357 ± 0,033
ns/op
PextPdepPerformanceBug.compressLongIntrinsic avgt 10 0,552 ± 0,012
ns/op
PextPdepPerformanceBug.compressLongSoftware avgt 10 16,197 ± 0,203
ns/op
PextPdepPerformanceBug.expandIntIntrinsic avgt 10 0,546 ± 0,006
ns/op
PextPdepPerformanceBug.expandIntSoftware avgt 10 12,179 ± 0,457
ns/op
PextPdepPerformanceBug.expandLongIntrinsic avgt 10 0,548 ± 0,018
ns/op
PextPdepPerformanceBug.expandLongSoftware avgt 10 17,658 ± 0,534
ns/op
```

And here are the results on a Ryzen 7 2700, which supports the BMI2
instruction set. but is also affected by this issue:
```
Benchmark Mode Cnt Score Error
Units
PextPdepPerformanceBug.compressIntIntrinsic avgt 10 28.010 ± 9.929
ns/op
PextPdepPerformanceBug.compressIntSoftware avgt 10 20.008 ± 2.129
ns/op
PextPdepPerformanceBug.compressLongIntrinsic avgt 10 48.999 ± 8.468
ns/op
PextPdepPerformanceBug.compressLongSoftware avgt 10 28.638 ± 5.336
ns/op
PextPdepPerformanceBug.expandIntIntrinsic avgt 10 24.860 ± 6.784
ns/op
PextPdepPerformanceBug.expandIntSoftware avgt 10 19.277 ± 1.719
ns/op
PextPdepPerformanceBug.expandLongIntrinsic avgt 10 43.889 ± 10.575
ns/op
PextPdepPerformanceBug.expandLongSoftware avgt 10 27.350 ± 1.898
ns/op
```

**Precedent and Scope**

A similar issue was reported in ~~JDK-8334474~~ [4], where the compress/expand
intrinsics were disabled on RISC-V because the vectorized implementation
caused regressions compared to the pure-Java fallback.
This led me to investigate whether other JDK intrinsics relying on BMI2
instructions might be affected.
The good news is that, as stated before, PEXT and PDEP are the only BMI2
instructions that AMD implemented via microcode on pre-Zen 3 processors:
the others execute efficiently on all BMI2-capable hardware.
I also verified that no other JDK methods use PEXT/PDEP, so the four
methods covered in this report (Long.compress, Long.expand,
Integer.compress, Integer.expand) should be the only ones affected.
It's worth verifying this though as the JDK is very large and I could have
missed such examples.

**Mitigation**

The intrinsic selection logic should check both BMI2 support and CPU
vendor/family.
Specifically, disable these intrinsics when the CPU vendor is AMD and the
family is less than 0x19 (Zen 3).
I think this could be implemented in x86.ad [5], alongside the existing
BMI2 check, but I'm not familiar with C2's source code.
Still, I would be happy to work on this issue myself if the issue is
verified and it's acceptable for me to work on it.

caused by

JDK-8283894 Intrinsify compress and expand bits on x86

Resolved

relates to

JDK-8334474 RISC-V: verify perf of ExpandBits/CompressBits (rvv)

Resolved

links to

Review(master) openjdk/jdk/29809

Details

Description

Attachments

Issue Links

Activity

People

Dates