- 
    Type:
Enhancement
 - 
    Resolution: Unresolved
 - 
    Priority:
  P4                     
     - 
    Affects Version/s: 26
 - 
    Component/s: hotspot
 
- 
        x86
 
                    I noticed that we have the MaxV and MinV implemented for long, but not the reduction. I think it should be possible to allow the reduction.
It already works for AVX512.
I found this during work on JDK-8340093, where it would now be considered profitable to vectorize the long min/max reduction.
See tests in:
test/hotspot/jtreg/compiler/loopopts/superword/TestReductions.java
I attached a Reduction.java for demonstration.
One can see that the element-wise MaxV is vectorized, but not the reduction. But an add-reduction is vectorized, so all the shuffling should be available. That indicates to me that we should be able to do a MaxV reduction.
Investigate if we have the same issue with the Vector API.
java -Xbatch -XX:CompileCommand=compileonly,Reduction::test* -XX:CompileCommand=printcompilation,Reduction::test* -XX:+TraceNewVectors -XX:UseAVX=2 -XX:CompileCommand=TraceAutoVectorization,Reduction::test*,SW_REJECTIONS Reduction.java
[empeter@emanuel bin]$ ./java -Xbatch -XX:CompileCommand=compileonly,Reduction::test* -XX:CompileCommand=printcompilation,Reduction::test* -XX:+TraceNewVectors -XX:UseAVX=2 -XX:CompileCommand=TraceAutoVectorization,Reduction::test*,SW_REJECTIONS Reduction.java
CompileCommand: compileonly Reduction.test* bool compileonly = true
CompileCommand: PrintCompilation Reduction.test* bool PrintCompilation = true
CompileCommand: TraceAutoVectorization Reduction.test* const char* TraceAutoVectorization = 'SW_REJECTIONS'
4018 97 % b 3 Reduction::test1 @ 4 (26 bytes)
4020 98 b 3 Reduction::test1 (26 bytes)
4021 99 % b 4 Reduction::test1 @ 4 (26 bytes)
SuperWord::transform_loop:
Loop: N562/N162 limit_check counted [int,int),+4 (10243 iters) main multiversion_fast has_sfpt strip_mined
562 CountedLoop === 562 275 162 [[ 557 561 562 271 565 566 476 236 ]] inner stride: 4 main of N562 strip mined multiversion_fast !orig=[473],[276],[245],[223] !jvms: Reduction::test1 @ bci:13 (line 18)
WARNING: Removed pack: not implemented at any smaller size:
0: 547 MaxL === _ 565 548 [[ 544 ]] !orig=463,225,199 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
1: 544 MaxL === _ 547 545 [[ 463 ]] !orig=225,199 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
2: 463 MaxL === _ 544 464 [[ 225 ]] !orig=225,199 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
3: 225 MaxL === _ 463 226 [[ 277 565 383 ]] !orig=199 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
WARNING: Removed pack: not profitable:
0: 548 LoadL === 394 7 549 [[ 547 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #long !orig=464,226,188 !jvms: Reduction::test1 @ bci:13 (line 18)
1: 545 LoadL === 394 7 546 [[ 544 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #long !orig=226,188 !jvms: Reduction::test1 @ bci:13 (line 18)
2: 464 LoadL === 394 7 465 [[ 463 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #long !orig=226,188 !jvms: Reduction::test1 @ bci:13 (line 18)
3: 226 LoadL === 394 7 227 [[ 225 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #long !orig=188 !jvms: Reduction::test1 @ bci:13 (line 18)
SuperWord::transform_loop failed: SuperWord::SLP_extract did not vectorize
4034 100 b 4 Reduction::test1 (26 bytes)
SuperWord::transform_loop:
Loop: N471/N170 limit_check counted [int,int),+4 (10243 iters) main has_sfpt strip_mined
471 CountedLoop === 471 186 170 [[ 471 182 474 477 ]] inner stride: 4 main of N471 strip mined !orig=[400],[187],[178],[116] !jvms: Reduction::test1 @ bci:10 (line 18)
WARNING: Removed pack: not implemented at any smaller size:
0: 461 MaxL === _ 474 462 [[ 460 ]] !orig=395,157,411 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
1: 460 MaxL === _ 461 464 [[ 395 ]] !orig=157,411 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
2: 395 MaxL === _ 460 396 [[ 157 ]] !orig=157,411 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
3: 157 MaxL === _ 395 225 [[ 474 188 332 ]] !orig=411 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
WARNING: Removed pack: not profitable:
0: 462 LoadL === 352 7 463 [[ 461 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5; #long (does not depend only on test, unknown control) !orig=396,225,[146] !jvms: Reduction::test1 @ bci:13 (line 18)
1: 464 LoadL === 352 7 465 [[ 460 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5; #long (does not depend only on test, unknown control) !orig=225,[146] !jvms: Reduction::test1 @ bci:13 (line 18)
2: 396 LoadL === 352 7 397 [[ 395 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5; #long (does not depend only on test, unknown control) !orig=225,[146] !jvms: Reduction::test1 @ bci:13 (line 18)
3: 225 LoadL === 352 7 144 [[ 157 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5; #long (does not depend only on test, unknown control) !orig=[146] !jvms: Reduction::test1 @ bci:13 (line 18)
SuperWord::transform_loop failed: SuperWord::SLP_extract did not vectorize
4153 101 % b 3 Reduction::test2 @ 2 (25 bytes)
4154 102 b 3 Reduction::test2 (25 bytes)
4155 103 % b 4 Reduction::test2 @ 2 (25 bytes)
SuperWord::transform_loop:
Loop: N591/N152 limit_check counted [int,int),+4 (10243 iters) main multiversion_fast has_sfpt strip_mined
591 CountedLoop === 591 295 152 [[ 571 574 585 590 591 291 594 595 489 503 255 242 ]] inner stride: 4 main of N591 strip mined multiversion_fast !orig=[500],[296],[263],[239] !jvms: Reduction::test2 @ bci:12 (line 25)
TraceNewVectors [AutoVectorization]: 639 Replicate === _ 23 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 640 LoadVector === 421 595 577 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched #vectory<J,4>
TraceNewVectors [AutoVectorization]: 641 MaxV === _ 640 639 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 642 StoreVector === 591 595 577 641 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6;
SuperWord::transform_loop: success
4173 104 b 4 Reduction::test2 (25 bytes)
SuperWord::transform_loop:
Loop: N505/N188 limit_check counted [int,int),+4 (10243 iters) main has_sfpt strip_mined
505 CountedLoop === 505 213 188 [[ 492 495 505 508 509 426 209 175 ]] inner stride: 4 main of N505 strip mined !orig=[432],[214],[205],[113] !jvms: Reduction::test2 @ bci:8 (line 25)
TraceNewVectors [AutoVectorization]: 574 Replicate === _ 143 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 575 LoadVector === 383 509 498 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; mismatched #vectory<J,4> (does not depend only on test, unknown control)
TraceNewVectors [AutoVectorization]: 576 MaxV === _ 575 574 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 577 StoreVector === 505 509 498 576 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; mismatched Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5;
SuperWord::transform_loop: success
4234 105 % b 3 Reduction::test3 @ 4 (24 bytes)
4235 106 b 3 Reduction::test3 (24 bytes)
4237 107 % b 4 Reduction::test3 @ 4 (24 bytes)
SuperWord::transform_loop:
Loop: N551/N162 limit_check counted [int,int),+4 (10243 iters) main multiversion_fast has_sfpt strip_mined
551 CountedLoop === 551 264 162 [[ 546 550 551 260 554 555 465 225 ]] inner stride: 4 main of N551 strip mined multiversion_fast !orig=[462],[265],[234],[212] !jvms: Reduction::test3 @ bci:13 (line 32)
TraceNewVectors [AutoVectorization]: 599 LoadVector === 383 7 538 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched #vectory<J,4>
TraceNewVectors [AutoVectorization]: 600 Replicate === _ 387 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 601 AddVL === _ 554 599 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 602 AddReductionVL === _ 345 601 [[ ]] no_strict_order
SuperWord::transform_loop: success
4260 108 b 4 Reduction::test3 (24 bytes)
SuperWord::transform_loop:
Loop: N461/N160 limit_check counted [int,int),+4 (10243 iters) main has_sfpt strip_mined
461 CountedLoop === 461 176 160 [[ 461 172 464 467 ]] inner stride: 4 main of N461 strip mined !orig=[390],[177],[168],[116] !jvms: Reduction::test3 @ bci:10 (line 32)
TraceNewVectors [AutoVectorization]: 531 LoadVector === 342 7 453 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; mismatched #vectory<J,4> (does not depend only on test, unknown control)
TraceNewVectors [AutoVectorization]: 532 Replicate === _ 22 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 533 AddVL === _ 464 531 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 534 AddReductionVL === _ 301 533 [[ ]] no_strict_order
SuperWord::transform_loop: success
It already works for AVX512.
I found this during work on JDK-8340093, where it would now be considered profitable to vectorize the long min/max reduction.
See tests in:
test/hotspot/jtreg/compiler/loopopts/superword/TestReductions.java
I attached a Reduction.java for demonstration.
One can see that the element-wise MaxV is vectorized, but not the reduction. But an add-reduction is vectorized, so all the shuffling should be available. That indicates to me that we should be able to do a MaxV reduction.
Investigate if we have the same issue with the Vector API.
java -Xbatch -XX:CompileCommand=compileonly,Reduction::test* -XX:CompileCommand=printcompilation,Reduction::test* -XX:+TraceNewVectors -XX:UseAVX=2 -XX:CompileCommand=TraceAutoVectorization,Reduction::test*,SW_REJECTIONS Reduction.java
[empeter@emanuel bin]$ ./java -Xbatch -XX:CompileCommand=compileonly,Reduction::test* -XX:CompileCommand=printcompilation,Reduction::test* -XX:+TraceNewVectors -XX:UseAVX=2 -XX:CompileCommand=TraceAutoVectorization,Reduction::test*,SW_REJECTIONS Reduction.java
CompileCommand: compileonly Reduction.test* bool compileonly = true
CompileCommand: PrintCompilation Reduction.test* bool PrintCompilation = true
CompileCommand: TraceAutoVectorization Reduction.test* const char* TraceAutoVectorization = 'SW_REJECTIONS'
4018 97 % b 3 Reduction::test1 @ 4 (26 bytes)
4020 98 b 3 Reduction::test1 (26 bytes)
4021 99 % b 4 Reduction::test1 @ 4 (26 bytes)
SuperWord::transform_loop:
Loop: N562/N162 limit_check counted [int,int),+4 (10243 iters) main multiversion_fast has_sfpt strip_mined
562 CountedLoop === 562 275 162 [[ 557 561 562 271 565 566 476 236 ]] inner stride: 4 main of N562 strip mined multiversion_fast !orig=[473],[276],[245],[223] !jvms: Reduction::test1 @ bci:13 (line 18)
WARNING: Removed pack: not implemented at any smaller size:
0: 547 MaxL === _ 565 548 [[ 544 ]] !orig=463,225,199 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
1: 544 MaxL === _ 547 545 [[ 463 ]] !orig=225,199 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
2: 463 MaxL === _ 544 464 [[ 225 ]] !orig=225,199 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
3: 225 MaxL === _ 463 226 [[ 277 565 383 ]] !orig=199 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
WARNING: Removed pack: not profitable:
0: 548 LoadL === 394 7 549 [[ 547 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #long !orig=464,226,188 !jvms: Reduction::test1 @ bci:13 (line 18)
1: 545 LoadL === 394 7 546 [[ 544 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #long !orig=226,188 !jvms: Reduction::test1 @ bci:13 (line 18)
2: 464 LoadL === 394 7 465 [[ 463 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #long !orig=226,188 !jvms: Reduction::test1 @ bci:13 (line 18)
3: 226 LoadL === 394 7 227 [[ 225 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=6; #long !orig=188 !jvms: Reduction::test1 @ bci:13 (line 18)
SuperWord::transform_loop failed: SuperWord::SLP_extract did not vectorize
4034 100 b 4 Reduction::test1 (26 bytes)
SuperWord::transform_loop:
Loop: N471/N170 limit_check counted [int,int),+4 (10243 iters) main has_sfpt strip_mined
471 CountedLoop === 471 186 170 [[ 471 182 474 477 ]] inner stride: 4 main of N471 strip mined !orig=[400],[187],[178],[116] !jvms: Reduction::test1 @ bci:10 (line 18)
WARNING: Removed pack: not implemented at any smaller size:
0: 461 MaxL === _ 474 462 [[ 460 ]] !orig=395,157,411 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
1: 460 MaxL === _ 461 464 [[ 395 ]] !orig=157,411 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
2: 395 MaxL === _ 460 396 [[ 157 ]] !orig=157,411 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
3: 157 MaxL === _ 395 225 [[ 474 188 332 ]] !orig=411 !jvms: Long::max @ bci:2 (line 1984) Reduction::test1 @ bci:14 (line 18)
WARNING: Removed pack: not profitable:
0: 462 LoadL === 352 7 463 [[ 461 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5; #long (does not depend only on test, unknown control) !orig=396,225,[146] !jvms: Reduction::test1 @ bci:13 (line 18)
1: 464 LoadL === 352 7 465 [[ 460 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5; #long (does not depend only on test, unknown control) !orig=225,[146] !jvms: Reduction::test1 @ bci:13 (line 18)
2: 396 LoadL === 352 7 397 [[ 395 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5; #long (does not depend only on test, unknown control) !orig=225,[146] !jvms: Reduction::test1 @ bci:13 (line 18)
3: 225 LoadL === 352 7 144 [[ 157 ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):exact+any *, idx=5; #long (does not depend only on test, unknown control) !orig=[146] !jvms: Reduction::test1 @ bci:13 (line 18)
SuperWord::transform_loop failed: SuperWord::SLP_extract did not vectorize
4153 101 % b 3 Reduction::test2 @ 2 (25 bytes)
4154 102 b 3 Reduction::test2 (25 bytes)
4155 103 % b 4 Reduction::test2 @ 2 (25 bytes)
SuperWord::transform_loop:
Loop: N591/N152 limit_check counted [int,int),+4 (10243 iters) main multiversion_fast has_sfpt strip_mined
591 CountedLoop === 591 295 152 [[ 571 574 585 590 591 291 594 595 489 503 255 242 ]] inner stride: 4 main of N591 strip mined multiversion_fast !orig=[500],[296],[263],[239] !jvms: Reduction::test2 @ bci:12 (line 25)
TraceNewVectors [AutoVectorization]: 639 Replicate === _ 23 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 640 LoadVector === 421 595 577 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched #vectory<J,4>
TraceNewVectors [AutoVectorization]: 641 MaxV === _ 640 639 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 642 StoreVector === 591 595 577 641 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6;
SuperWord::transform_loop: success
4173 104 b 4 Reduction::test2 (25 bytes)
SuperWord::transform_loop:
Loop: N505/N188 limit_check counted [int,int),+4 (10243 iters) main has_sfpt strip_mined
505 CountedLoop === 505 213 188 [[ 492 495 505 508 509 426 209 175 ]] inner stride: 4 main of N505 strip mined !orig=[432],[214],[205],[113] !jvms: Reduction::test2 @ bci:8 (line 25)
TraceNewVectors [AutoVectorization]: 574 Replicate === _ 143 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 575 LoadVector === 383 509 498 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; mismatched #vectory<J,4> (does not depend only on test, unknown control)
TraceNewVectors [AutoVectorization]: 576 MaxV === _ 575 574 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 577 StoreVector === 505 509 498 576 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; mismatched Memory: @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5;
SuperWord::transform_loop: success
4234 105 % b 3 Reduction::test3 @ 4 (24 bytes)
4235 106 b 3 Reduction::test3 (24 bytes)
4237 107 % b 4 Reduction::test3 @ 4 (24 bytes)
SuperWord::transform_loop:
Loop: N551/N162 limit_check counted [int,int),+4 (10243 iters) main multiversion_fast has_sfpt strip_mined
551 CountedLoop === 551 264 162 [[ 546 550 551 260 554 555 465 225 ]] inner stride: 4 main of N551 strip mined multiversion_fast !orig=[462],[265],[234],[212] !jvms: Reduction::test3 @ bci:13 (line 32)
TraceNewVectors [AutoVectorization]: 599 LoadVector === 383 7 538 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=6; mismatched #vectory<J,4>
TraceNewVectors [AutoVectorization]: 600 Replicate === _ 387 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 601 AddVL === _ 554 599 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 602 AddReductionVL === _ 345 601 [[ ]] no_strict_order
SuperWord::transform_loop: success
4260 108 b 4 Reduction::test3 (24 bytes)
SuperWord::transform_loop:
Loop: N461/N160 limit_check counted [int,int),+4 (10243 iters) main has_sfpt strip_mined
461 CountedLoop === 461 176 160 [[ 461 172 464 467 ]] inner stride: 4 main of N461 strip mined !orig=[390],[177],[168],[116] !jvms: Reduction::test3 @ bci:10 (line 32)
TraceNewVectors [AutoVectorization]: 531 LoadVector === 342 7 453 [[ ]] @long[int:>=0] (java/lang/Cloneable,java/io/Serializable):NotNull:exact+any *, idx=5; mismatched #vectory<J,4> (does not depend only on test, unknown control)
TraceNewVectors [AutoVectorization]: 532 Replicate === _ 22 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 533 AddVL === _ 464 531 [[ ]] #vectory<J,4>
TraceNewVectors [AutoVectorization]: 534 AddReductionVL === _ 301 533 [[ ]] no_strict_order
SuperWord::transform_loop: success
- relates to
 - 
                    
JDK-8370673 C2 SuperWord [x86]: implement long mul reduction
-         
     - Open
 
 -         
 - 
                    
JDK-8370677 AArch64: C2 SuperWord: implement sequential reduction for add/mul D/F
-         
     - Open
 
 -         
 - 
                    
JDK-8340093 C2 SuperWord: implement cost model
-         
     - Open
 
 -