PEXT/PDEP intrinsics cause performance regression on AMD pre-Zen 3 CPUs

XMLWordPrintable

      // Received via hotspot-compiler-dev@openjdk.org

      Hi,

      today I stumbled upon a performance issue with the Long.compress/expand and
      Integer.compress/expand intrinsics on certain AMD processors. I discovered
      this while working on an optimized varint decoder where I was hoping to use
      Long.compress() to speed up bit extraction. Instead, I found my "optimized"
      version was slower than my naive loop-based implementation. After some
      digging, I believe I understand what's happening.

      **Background**

      The compress and expand methods (added in JDK 19 via JDK-8283893 [1]) are
      intrinsified by C2 to use the BMI2 PEXT and PDEP instructions when the CPU
      reports BMI2 support.
      This works great on Intel Haswell+ and AMD Zen 3+, where these instructions
      execute in dedicated hardware with approximately 3-cycle latency.
      However, AMD processors from Excavator before Zen 3 implement PEXT/PDEP via
      microcode emulation rather than native hardware.
      This is confirmed by AMD's Software Optimization Guide for Family 19h
      Processors [2], Section 2.10.2, which states that Zen 3 has native ALU
      support for these instructions.
      Wikipedia's page on x86 Bit Manipulation Instruction Sets [3] also
      documents this behavior:

      > AMD processors before Zen 3 that implement PDEP and PEXT do so in
      > microcode, with a latency of 18 cycles rather than (Zen 3) 3 cycles. As a
      > result it is often faster to use other instructions on these processors.

      **Reproducer**

      Here is a JMH benchmark that demonstrates the issue by comparing the
      intrinsified path against the software fallback using ControlIntrinsic
      flags:

      ```
      import org.openjdk.jmh.annotations.*;

      import java.util.concurrent.ThreadLocalRandom;
      import java.util.concurrent.TimeUnit;

      @BenchmarkMode(Mode.AverageTime)
      @OutputTimeUnit(TimeUnit.NANOSECONDS)
      @Warmup(iterations = 5, time = 1)
      @Measurement(iterations = 5, time = 1)
      @State(Scope.Benchmark)
      public class PextPdepPerformanceBug {
          // I'm not using constants to prevent constant folding
          private long longValue;
          private long longMask;
          private int intValue;
          private int intMask;

          @Setup(Level.Iteration)
          public void setup() {
              var rng = ThreadLocalRandom.current();
              longValue = rng.nextLong();
              longMask = rng.nextLong();
              intValue = rng.nextInt();
              intMask = rng.nextInt();
          }

          // Long.compress (PEXT 64-bit)

          @Benchmark
          @Fork(value = 2, jvmArgsAppend = {
              "-XX:+UnlockDiagnosticVMOptions",
              "-XX:ControlIntrinsic=-_compress_l",
              "-Xcomp"
          })
          public long compressLongSoftware() {
              return Long.compress(longValue, longMask);
          }

          @Benchmark
          @Fork(value = 2, jvmArgsAppend = {
              "-XX:+UnlockDiagnosticVMOptions",
              "-XX:ControlIntrinsic=+_compress_l",
              "-Xcomp"
          })
          public long compressLongIntrinsic() {
              return Long.compress(longValue, longMask);
          }

          // Long.expand (PDEP 64-bit)

          @Benchmark
          @Fork(value = 2, jvmArgsAppend = {
              "-XX:+UnlockDiagnosticVMOptions",
              "-XX:ControlIntrinsic=-_expand_l",
              "-Xcomp"
          })
          public long expandLongSoftware() {
              return Long.expand(longValue, longMask);
          }

          @Benchmark
          @Fork(value = 2, jvmArgsAppend = {
              "-XX:+UnlockDiagnosticVMOptions",
              "-XX:ControlIntrinsic=+_expand_l",
              "-Xcomp"
          })
          public long expandLongIntrinsic() {
              return Long.expand(longValue, longMask);
          }

          // Integer.compress (PEXT 32-bit)

          @Benchmark
          @Fork(value = 2, jvmArgsAppend = {
              "-XX:+UnlockDiagnosticVMOptions",
              "-XX:ControlIntrinsic=-_compress_i",
              "-Xcomp"
          })
          public int compressIntSoftware() {
              return Integer.compress(intValue, intMask);
          }

          @Benchmark
          @Fork(value = 2, jvmArgsAppend = {
              "-XX:+UnlockDiagnosticVMOptions",
              "-XX:ControlIntrinsic=+_compress_i",
              "-Xcomp"
          })
          public int compressIntIntrinsic() {
              return Integer.compress(intValue, intMask);
          }

          // Integer.expand (PDEP 32-bit)

          @Benchmark
          @Fork(value = 2, jvmArgsAppend = {
              "-XX:+UnlockDiagnosticVMOptions",
              "-XX:ControlIntrinsic=-_expand_i",
              "-Xcomp"
          })
          public int expandIntSoftware() {
              return Integer.expand(intValue, intMask);
          }

          @Benchmark
          @Fork(value = 2, jvmArgsAppend = {
              "-XX:+UnlockDiagnosticVMOptions",
              "-XX:ControlIntrinsic=+_expand_i",
              "-Xcomp"
          })
          public int expandIntIntrinsic() {
              return Integer.expand(intValue, intMask);
          }
      }
      ```

      Here are the results on an i7 9700K, which supports the BMI2 instruction
      set and is not affected by this issue:
      ```
      Benchmark Mode Cnt Score Error
       Units
      PextPdepPerformanceBug.compressIntIntrinsic avgt 10 0,545 ± 0,002
       ns/op
      PextPdepPerformanceBug.compressIntSoftware avgt 10 11,357 ± 0,033
       ns/op
      PextPdepPerformanceBug.compressLongIntrinsic avgt 10 0,552 ± 0,012
       ns/op
      PextPdepPerformanceBug.compressLongSoftware avgt 10 16,197 ± 0,203
       ns/op
      PextPdepPerformanceBug.expandIntIntrinsic avgt 10 0,546 ± 0,006
       ns/op
      PextPdepPerformanceBug.expandIntSoftware avgt 10 12,179 ± 0,457
       ns/op
      PextPdepPerformanceBug.expandLongIntrinsic avgt 10 0,548 ± 0,018
       ns/op
      PextPdepPerformanceBug.expandLongSoftware avgt 10 17,658 ± 0,534
       ns/op
      ```

      And here are the results on a Ryzen 7 2700, which supports the BMI2
      instruction set. but is also affected by this issue:
      ```
      Benchmark Mode Cnt Score Error
       Units
      PextPdepPerformanceBug.compressIntIntrinsic avgt 10 28.010 ± 9.929
       ns/op
      PextPdepPerformanceBug.compressIntSoftware avgt 10 20.008 ± 2.129
       ns/op
      PextPdepPerformanceBug.compressLongIntrinsic avgt 10 48.999 ± 8.468
       ns/op
      PextPdepPerformanceBug.compressLongSoftware avgt 10 28.638 ± 5.336
       ns/op
      PextPdepPerformanceBug.expandIntIntrinsic avgt 10 24.860 ± 6.784
       ns/op
      PextPdepPerformanceBug.expandIntSoftware avgt 10 19.277 ± 1.719
       ns/op
      PextPdepPerformanceBug.expandLongIntrinsic avgt 10 43.889 ± 10.575
       ns/op
      PextPdepPerformanceBug.expandLongSoftware avgt 10 27.350 ± 1.898
       ns/op
      ```

      **Precedent and Scope**

      A similar issue was reported in JDK-8334474 [4], where the compress/expand
      intrinsics were disabled on RISC-V because the vectorized implementation
      caused regressions compared to the pure-Java fallback.
      This led me to investigate whether other JDK intrinsics relying on BMI2
      instructions might be affected.
      The good news is that, as stated before, PEXT and PDEP are the only BMI2
      instructions that AMD implemented via microcode on pre-Zen 3 processors:
      the others execute efficiently on all BMI2-capable hardware.
      I also verified that no other JDK methods use PEXT/PDEP, so the four
      methods covered in this report (Long.compress, Long.expand,
      Integer.compress, Integer.expand) should be the only ones affected.
      It's worth verifying this though as the JDK is very large and I could have
      missed such examples.

      **Mitigation**

      The intrinsic selection logic should check both BMI2 support and CPU
      vendor/family.
      Specifically, disable these intrinsics when the CPU vendor is AMD and the
      family is less than 0x19 (Zen 3).
      I think this could be implemented in x86.ad [5], alongside the existing
      BMI2 check, but I'm not familiar with C2's source code.
      Still, I would be happy to work on this issue myself if the issue is
      verified and it's acceptable for me to work on it.

            Assignee:
            Unassigned
            Reporter:
            Galder Zamarreño
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: