Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8323609

C2: Odd vectorization breakage with DBB.getLong loop

XMLWordPrintable

      Spotted this during related performance work. If you run the current bytebuffer microbenchmarks, then one of them stands out:

      ```
      % CONF=linux-x86_64-server-release make images test TEST="micro:ByteBuffers.testDirect.*Long" MICRO="FORK=1;OPTIONS=-p size=131072"

      ByteBuffers.testDirectLoopGetLong: 1904.220 +- 0.555 ns/op
      ByteBuffers.testDirectLoopGetLongRO: 1914.562 +- 7.225 ns/op
      ByteBuffers.testDirectLoopGetLongSwap: 4839.337 +- 2.398 ns/op <---- !!!
      ByteBuffers.testDirectLoopGetLongSwapRO: 1902.759 +- 0.812 ns/op
      ByteBuffers.testDirectLoopPutLong: 2068.266 +- 2.197 ns/op
      ByteBuffers.testDirectLoopPutLongSwap: 2104.532 +- 2.153 ns/op
      ```

      testDirectLoopGetLongSwap is way out of band, with 2x throughput loss.

      Perfasm shows that in the bad case we have not auto-vectorized the loop, there is a sequence of 8-byte reads+adds. Good cases are all auto-vectorized with 256-byte reads. What is even more funky that the bad case gets "repaired" when one asks for read-only (RO) version of it, see testDirectLoopGetLongSwapRO!

      (Note that "swap" is misleading, it "swaps" default big-endian BB to little-endian, which matches x86.)

      This reliably reproduces Xeon Platinum 8124M. I have not investigated deeply (at least yet).

            epeter Emanuel Peter
            shade Aleksey Shipilev
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: