Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Unresolved
Priority: P5
Fix Version/s: tbd
Affects Version/s: 22
Component/s: hotspot
Labels:
- c2-superword
- performance

Subcomponent:
compiler
CPU:

x86_64

- On legacy Xeons (Skylake, Cascade Lake) we saw significant performance degradation with AVX512 instruction due to frequency level switchover penalty, reduced core frequency and hysteresis effect which causes subsequent non-AVX512 instruction stream to execute at a lower frequency.
- Problem became severe in workloads where we see spikes of AVX512 instructions in otherwise AVX2 instruction trace.
- Some of the X86 stubs (array copy / fills) do take AVX3Thresholds into consideration but not all.

There are multiple approaches to address this problem.
1) Multi versioning of loops guarded by AVX3Threshold, but this may result into code bloating given that we already split the iteration space into
pre-main-post (atomic vector)-vector_tail- scalar tail.

2) We can use loop profiling information collected by C1 and interpreter to dynamically adjust per loop VectorSize.

This pass can also take other factors like instruction costs and target feature availability to dynamically adjust per loop vector size, may address some of the issues discovered on KNL (https://bugs.openjdk.org/browse/JDK-8309267)

FTR, from X86 standpoint it is not a pressing issue since latest Xeons are past frequency problems with AVX512.

relates to

JDK-8287697 Limit auto vectorization to 32-byte vector on Cascade Lake

Resolved

Assignee:: Jatin Bhateja

Reporter:: Jatin Bhateja

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2023-07-11 22:27

Updated:: 2023-08-21 03:47

Details

Description

Attachments

Issue Links

Activity

People

Dates