Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Unresolved
Priority: P4
Fix Version/s: tbd
Affects Version/s: 17, 21, 22
Component/s: hotspot
Labels:
- c2
- c2-regalloc
- dcsswa
- performance
- reproducer-no
- webbug

Subcomponent:
compiler
CPU:

x86_64
OS:

linux_oracle_6.0

ADDITIONAL SYSTEM INFORMATION :
Red Hat Enterprise Linux 8.8 (Ootpa)
openjdk 17.0.7 2023-04-18 LTS
openjdk 21-ea 2023-09-19

A DESCRIPTION OF THE PROBLEM :
When running a simple look like:

for (int i = 0; i < 512; i++) {
result[i] = a[i] + b[i];
}

Where a, b, result are of size 64 and have random value.
The JIT produces the following code (tested with various array sizes) for the main part of the loop (and similar code to the other parts):

L0002: vmovdqu32 0x10(%rax,%r11,4),%zmm0
0x00007f0c604ff4db: vaddps 0x10(%rdi,%r11,4),%zmm0,%zmm0
0x00007f0c604ff4e6: vmovdqu32 %zmm0,0x10(%rdx,%r11,4)
0x00007f0c604ff4f1: vmovdqu32 0x50(%rax,%r11,4),%zmm0
0x00007f0c604ff4fc: vaddps 0x50(%rdi,%r11,4),%zmm0,%zmm0
0x00007f0c604ff507: vmovdqu32 %zmm0,0x50(%rdx,%r11,4)
0x00007f0c604ff512: vmovdqu32 0x90(%rax,%r11,4),%zmm0
0x00007f0c604ff51d: vaddps 0x90(%rdi,%r11,4),%zmm0,%zmm0
0x00007f0c604ff528: vmovdqu32 %zmm0,0x90(%rdx,%r11,4)
0x00007f0c604ff533: vmovdqu32 0xd0(%rax,%r11,4),%zmm0
0x00007f0c604ff53e: vaddps 0xd0(%rdi,%r11,4),%zmm0,%zmm0
0x00007f0c604ff549: vmovdqu32 %zmm0,0xd0(%rdx,%r11,4) ;*fastore {reexecute=0 rethrow=0 return_oop=0}

As you can tell, its only uses 1 SIMD register while they are plenty more available in modern CPUs, which means it is not taking full advantage of instruction level parallelism

Assignee:: Unassigned
Reporter:: Praveen Narayanaswamy
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: 2023-07-10 04:55
Updated:: 2023-08-04 22:55

Details

Description

Attachments

Activity

People

Dates