Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Fixed
Priority: P4
Fix Version/s: 27
Affects Version/s: 26
Component/s: hotspot
Labels:

Subcomponent:
compiler
Resolved In Build:
b03

First investigation into benchmarks done here:
https://github.com/openjdk/jdk/pull/26747#issuecomment-3269114783 / ~~JDK-8365290~~.

It seems to me that people are making decisions about fill and copy intrinsics on benchmarks that are noisy and don't properly control for alignment - that can give us misleading results.

It turns out that we barely have any fill and copy benchmarks that really test automatic alignment.

We should also compare to auto-vectorization performance.

We should test Array.fill, System.arraycopy, but also some MemorySegment bulk operations. Then also compare to naive loops, both with intrinsics enabled and disabled: -XX:-OptimizeFill

Also look at JDK-8299808, and the discussion there.

We could take a similar approach as in ~~JDK-8355094~~ with:
test/micro/org/openjdk/bench/vm/compiler/VectorAutoAlignment.java

We should also go through the benchmarks mentioned in
https://github.com/openjdk/jdk/pull/26747#issuecomment-3269114783
and see if they still behave as the comments in them suggest:
- alignment assumptions
- performance assumptions / comparison with SuperWord, especially after ~~JDK-8324751~~.

This is also a really good way to better understand the performance of auto-vectorization (SuperWord) on small iteration counts. This is where the intrinsics are currently much better than auto-vectorization. See also JDK-8344085. But it is possible that auto-vectorization is actually faster with large iteration counts.

For MemorySegment, we already have:
- ./test/micro/org/openjdk/bench/java/lang/foreign/BulkOps.java
- ./test/micro/org/openjdk/bench/java/lang/foreign/SegmentBulkFill.java
- ./test/micro/org/openjdk/bench/java/lang/foreign/SegmentBulkCopy.java

We also should make sure to check fill for zero separately, some platforms are much faster when they zero out memory.

We should also check the impact of Lilliput / CompactObjectHeaders, as those change the alignment of some element types.

We should also benchmark Oop copy / fill. Auto-vectorization could pay off here too, though it would be harder because of GC barriers in the vectorized LoadP and StoreP.
./java -XX:CompileCommand=compileonly,TestOopCopy::copy* -XX:CompileCommand=printcompilation,TestOopCopy::copy* -Xbatch TestOopCopy.java

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

arrays_linux_aarch64.png
2.45 MB
2025-11-07 01:31
arrays_linux_x64_oci.png
3.07 MB
2025-11-07 01:31
arrays_macosx_aarch64.png
2.65 MB
2025-11-07 01:31
arrays_macosx_x64_sandybridge.png
3.06 MB
2025-11-07 01:31
arrays_windows_x64_oci.png
3.14 MB
2025-11-07 01:31
TestOopCopy.java
1 kB
2025-09-18 00:26

relates to

JDK-8344085 C2 SuperWord: improve vectorization for small loop iteration count

Open

JDK-8368061 C2 SuperWord: allow more control over loop unrolling and super-unrolling

Open

JDK-8365290 [perf] x86 ArrayFill intrinsic generates SPLIT_STORE for unaligned arrays

Resolved

JDK-8299808 C2 SuperWord: investigate performance difference to ArrayFill

Open

JDK-8372544 Performance of bulk memory access intrinsics is inconsistent

Open

links to

Commit(master) openjdk/jdk/650de99f

(1 links to)

Assignee:: Emanuel Peter
Reporter:: Emanuel Peter
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: 2025-09-08 23:50
Updated:: 2025-12-23 17:34
Resolved:: 2025-12-11 23:21

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates