Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: P4
Fix Version/s: tbd
Affects Version/s: 23
Component/s: hotspot
Labels:

Subcomponent:
compiler

Spotted this during related performance work. If you run the current bytebuffer microbenchmarks, then one of them stands out:

```
% CONF=linux-x86_64-server-release make images test TEST="micro:ByteBuffers.testDirect.*Long" MICRO="FORK=1;OPTIONS=-p size=131072"

ByteBuffers.testDirectLoopGetLong: 1904.220 +- 0.555 ns/op
ByteBuffers.testDirectLoopGetLongRO: 1914.562 +- 7.225 ns/op
ByteBuffers.testDirectLoopGetLongSwap: 4839.337 +- 2.398 ns/op <---- !!!
ByteBuffers.testDirectLoopGetLongSwapRO: 1902.759 +- 0.812 ns/op
ByteBuffers.testDirectLoopPutLong: 2068.266 +- 2.197 ns/op
ByteBuffers.testDirectLoopPutLongSwap: 2104.532 +- 2.153 ns/op
```

testDirectLoopGetLongSwap is way out of band, with 2x throughput loss.

Perfasm shows that in the bad case we have not auto-vectorized the loop, there is a sequence of 8-byte reads+adds. Good cases are all auto-vectorized with 256-byte reads. What is even more funky that the bad case gets "repaired" when one asks for read-only (RO) version of it, see testDirectLoopGetLongSwapRO!

(Note that "swap" is misleading, it "swaps" default big-endian BB to little-endian, which matches x86.)

This reliably reproduces Xeon Platinum 8124M. I have not investigated deeply (at least yet).

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Test1.java
1 kB
2024-01-12 07:11

duplicates

JDK-8307516 C2 SuperWord: reconsider Reduction heuristic for UnorderedReduction

Open

Assignee:: Emanuel Peter

Reporter:: Aleksey Shipilev

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024-01-11 10:10

Updated:: 2024-01-15 09:18

Resolved:: 2024-01-14 23:37

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates