Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8314612

TestUnorderedReduction.java fails with -XX:MaxVectorSize=32 and -XX:+AlignVector

XMLWordPrintable

    • b15
    • aarch64
    • generic

        1. How to produce the bug

        When changing -XX:MaxVectorSize to 32 in `test/hotspot/jtreg/compiler/loopopopts/superword/TestUnorderedReduction.java`, and executing it with the following command, we will get an execution error.

        ```
        zifeihan@d915263bc793:~/jdk$ git diff test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
        diff --git a/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java b/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
        index 18f3b6930ea..952a56dd842 100644
        --- a/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
        +++ b/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
        @@ -40,7 +40,8 @@ public class TestUnorderedReduction {
             public static void main(String[] args) {
                 TestFramework.runWithFlags("-Xbatch",
                                            "-XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test*",
        - "-XX:MaxVectorSize=16");
        + "-XX:MaxVectorSize=32",
        + "-XX:+AvoidUnalignedAccesses");
             }
         
             @Run(test = {"test1", "test2", "test3"})
        ```

        Execute the command as follows(The jdk executed as above is a version of sve packaged with qemu-user):

        ```
        /home/zifeihan/jtreg/bin/jtreg \
        -J-Djavatest.maxOutputSize=500000 \
        -Djdk.lang.Process.launchMechanism=vfork \
        -v:default \
        -concurrency:32 \
        -timeout:50 \
        -javaoption:-XX:UseSVE=2 \
        -jdk:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk \
        /home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword/TestUnorderedReduction.java
        ```

        The exceptions are as follows:

        ```
        ----------System.out:(19/3921)----------
        Run Flag VM:
        Command line: [/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/java -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0:/home/zifeihan/jdk/test/hotspot/jtreg:/home/zifeihan/jtreg/lib/javatest.jar:/home/zifeihan/jtreg/lib/jtreg.jar -Djdk.lang.Process.launchMechanism=vfork -XX:UseSVE=2 -Dtest.jdk=/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk -Djava.library.path=. -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0 -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xbatch -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test* -XX:MaxVectorSize=32 -XX:+AvoidUnalignedAccesses compiler.lib.ir_framework.flag.FlagVM compiler.loopopts.superword.TestUnorderedReduction ]
        [2023-08-19T02:00:01.969752090Z] Gathering output for process 91032
        [2023-08-19T02:00:08.177648302Z] Waiting for completion for process 91032
        [2023-08-19T02:00:08.180735552Z] Waiting for completion finished for process 91032
        Output and diagnostic info for process 91032 was saved into 'pid-91032-output.log'
        [2023-08-19T02:00:08.206587843Z] Waiting for completion for process 91032
        [2023-08-19T02:00:08.208777677Z] Waiting for completion finished for process 91032
        Run Test VM - [-Xbatch, -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test*, -XX:MaxVectorSize=32, -XX:+AvoidUnalignedAccesses]:
        Command line: [/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/java -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0:/home/zifeihan/jdk/test/hotspot/jtreg:/home/zifeihan/jtreg/lib/javatest.jar:/home/zifeihan/jtreg/lib/jtreg.jar -Djava.library.path=. -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Djdk.lang.Process.launchMechanism=vfork -XX:UseSVE=2 -Dir.framework.server.port=34611 -Xbatch -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test* -XX:MaxVectorSize=32 -XX:+AvoidUnalignedAccesses -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:CompilerDirectivesFile=test-vm-compile-commands-pid-91034.log -XX:CompilerDirectivesLimit=31 -XX:-OmitStackTraceInFastThrow -DShouldDoIRVerification=true -XX:-BackgroundCompilation -XX:CompileCommand=quiet compiler.lib.ir_framework.test.TestVM compiler.loopopts.superword.TestUnorderedReduction ]
        [2023-08-19T02:00:08.628667469Z] Gathering output for process 91057
        [2023-08-19T02:00:16.553228083Z] Waiting for completion for process 91057
        [2023-08-19T02:00:16.554576500Z] Waiting for completion finished for process 91057
        Output and diagnostic info for process 91057 was saved into 'pid-91057-output.log'
        [2023-08-19T02:00:16.712613250Z] Waiting for completion for process 91057
        [2023-08-19T02:00:16.713073750Z] Waiting for completion finished for process 91057
        [2023-08-19T02:00:16.719087791Z] Waiting for completion for process 91057
        [2023-08-19T02:00:16.723273500Z] Waiting for completion finished for process 91057

        ----------System.err:(65/4837)----------

        TestFramework test VM exited with code 1

        Command Line:
        /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/java -DReproduce=true -cp /home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/compiler/loopopts/superword/TestUnorderedReduction.d:/home/zifeihan/jdk/test/hotspot/jtreg/compiler/loopopts/superword:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0/test/lib:/home/zifeihan/jdk/test/lib:/home/zifeihan/jdk/build/linux-aarch64-server-fastdebug/jdk/bin/JTwork/classes/0:/home/zifeihan/jdk/test/hotspot/jtreg:/home/zifeihan/jtreg/lib/javatest.jar:/home/zifeihan/jtreg/lib/jtreg.jar -Djava.library.path=. -Xbootclasspath/a:. -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Djdk.lang.Process.launchMechanism=vfork -XX:UseSVE=2 -Dir.framework.server.port=34611 -Xbatch -XX:CompileCommand=compileonly,compiler.loopopts.superword.TestUnorderedReduction::test* -XX:MaxVectorSize=32 -XX:+AvoidUnalignedAccesses -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:CompilerDirectivesFile=test-vm-compile-commands-pid-91034.log -XX:CompilerDirectivesLimit=31 -XX:-OmitStackTraceInFastThrow -DShouldDoIRVerification=true -XX:-BackgroundCompilation -XX:CompileCommand=quiet compiler.lib.ir_framework.test.TestVM compiler.loopopts.superword.TestUnorderedReduction


        Error Output
        ------------
        Exception in thread "main" compiler.lib.ir_framework.shared.TestRunException:

        Test Failures (1)
        -----------------
        Custom Run Test: @Run: runTests - @Tests: {test1,test2,test3}:
        compiler.lib.ir_framework.shared.TestRunException: There was an error while invoking @Run method public void compiler.loopopts.superword.TestUnorderedReduction.runTests() throws java.lang.Exception
        at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:162)
        at compiler.lib.ir_framework.test.AbstractTest.run(AbstractTest.java:104)
        at compiler.lib.ir_framework.test.CustomRunTest.run(CustomRunTest.java:89)
        at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:822)
        at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:249)
        at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:164)
        Caused by: java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:118)
        at java.base/java.lang.reflect.Method.invoke(Method.java:580)
        at compiler.lib.ir_framework.test.CustomRunTest.invokeTest(CustomRunTest.java:159)
        ... 5 more
        Caused by: java.lang.RuntimeException: Wrong result test2: 3469730 != 5772800
        at compiler.loopopts.superword.TestUnorderedReduction.runTests(TestUnorderedReduction.java:65)
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
        ... 7 more



        at compiler.lib.ir_framework.test.TestVM.runTests(TestVM.java:857)
        at compiler.lib.ir_framework.test.TestVM.start(TestVM.java:249)
        at compiler.lib.ir_framework.test.TestVM.main(TestVM.java:164)


          #############################################################
           - To only run the failed tests use -DTest, -DExclude,
             and/or -DScenarios.
           - To also get the standard output of the test VM run with
             -DReportStdout=true or for even more fine-grained logging
             use -DVerbose=true.
          #############################################################
        ```



        2. Reduced test case

        ```
        public class TestUnorderedReduction {
            static final int RANGE = 512;
            static final int ITER = 10;

            public static void main(String[] args) throws Exception {
                final TestUnorderedReduction testUnorderedReduction = new TestUnorderedReduction();
                for (int i = 0; i < 500; i++) {
                    testUnorderedReduction.runTests();
                }
            }

            public void runTests() throws Exception {
                int[] data = new int[RANGE];

                init(data);
                for (int i = 0; i < ITER; i++) {
                    int r1 = test2(data, i);
                    int r2 = ref2(data, i);
                    if (r1 != r2) {
                        throw new RuntimeException("Wrong result test2: " + r1 + " != " + r2);
                    }
                }
            }

            static int test2(int[] data, int sum) {
                for (int i = 0; i < RANGE; i+=8) {
                    sum += 3 * data[i+0];
                    sum += 3 * data[i+1];
                    sum += 3 * data[i+2];
                    sum += 3 * data[i+3];
                    sum += 3 * data[i+4];
                    sum += 3 * data[i+5];
                    sum += 3 * data[i+6];
                    sum += 3 * data[i+7];
                }
                return sum;
            }

            static int ref2(int[] data, int sum) {
                for (int i = 0; i < RANGE; i+=8) {
                    sum += 3 * data[i+0];
                    sum += 3 * data[i+1];
                    sum += 3 * data[i+2];
                    sum += 3 * data[i+3];
                    sum += 3 * data[i+4];
                    sum += 3 * data[i+5];
                    sum += 3 * data[i+6];
                    sum += 3 * data[i+7];
                }
                return sum;
            }

            static void init(int[] data) {
                for (int i = 0; i < RANGE; i++) {
                    data[i] = 1;
                }
            }
        }
        ```

        Execute this simple test case with:`./java -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction`, it passes normally. But it fails when using :`./java -XX:+AvoidUnalignedAccesses -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction`.
        ```
        zifeihan@d915263bc793:~/jdk/build/linux-aarch64-server-fastdebug/jdk/bin$ /home/zifeihan/qemu-7.1.0-rc1-aarch64/bin/qemu-aarch64 -cpu max,sve256=on ./java-bak -XX:+AvoidUnalignedAccesses -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction


        CompileCommand: compileonly TestUnorderedReduction.test* bool compileonly = true
        Exception in thread "main" java.lang.RuntimeException: Wrong result test2: 1034 != 1538
                at TestUnorderedReduction.runTests(TestUnorderedReduction.java:20)
                at TestUnorderedReduction.main(TestUnorderedReduction.java:8)
        ```

        The sve is emulated using qemu-user and the sve width is set to 256.

        ```
        /home/zifeihan/qemu-7.1.0-rc1-aarch64/bin/qemu-aarch64 -cpu max,sve256=on ./java -XX:+AvoidUnalignedAccesses -Xbatch -XX:CompileCommand=compileonly,TestUnorderedReduction::test* -XX:UseSVE=2 TestUnorderedReduction
        ```

        3. C2 JIT code

        3.1 C2 JIT code for TestUnorderedReduction::test2 when test case passes.

        ```
        160 B15: # out( B16 ) &lt;- in( B16 ) top-of-loop Freq: 64.0845
        160 spill R13 -&gt; R10 # spill size = 32

        164 B16: # out( B15 B17 ) &lt;- in( B13 B15 ) Loop( B16-B15 inner main of N53) Freq: 65.0845
        164 add R12, R1, R10, I2L #2 # ptr
        168 add R13, R12, #16 # ptr
        16c loadV V17, [R13] # vector (sve)
        170 add R13, R12, #48 # ptr
        174 loadV V18, [R13] # vector (sve)
        178 vlsl_imm V19, V17, #1
        17c add R13, R12, #80 # ptr
        180 vaddI V17, V19, V17
        184 loadV V19, [R13] # vector (sve)
        188 vlsl_imm V20, V18, #1
        18c vaddI V16, V16, V17
        190 vaddI V17, V20, V18
        194 add R13, R12, #112 # ptr
        198 loadV V18, [R13] # vector (sve)
        19c vlsl_imm V20, V19, #1
        1a0 vaddI V16, V16, V17
        1a4 vaddI V17, V20, V19
        1a8 add R13, R12, #144 # ptr
        1ac loadV V19, [R13] # vector (sve)
        1b0 vlsl_imm V20, V18, #1
        1b4 add R13, R12, #176 # ptr
        1b8 vaddI V18, V20, V18
        1bc loadV V20, [R13] # vector (sve)
        1c0 vaddI V16, V16, V17
        1c4 vlsl_imm V17, V19, #1
        1c8 vaddI V16, V16, V18
        1cc vaddI V17, V17, V19
        1d0 add R13, R12, #208 # ptr
        1d4 loadV V18, [R13] # vector (sve)
        1d8 vlsl_imm V19, V20, #1
        1dc add R12, R12, #240 # ptr
        1e0 vaddI V19, V19, V20
        1e4 loadV V20, [R12] # vector (sve)
        1e8 vaddI V16, V16, V17
        1ec vlsl_imm V17, V18, #1
        1f0 vaddI V16, V16, V19
        1f4 vaddI V17, V17, V18
        1f8 vlsl_imm V18, V20, #1
        1fc vaddI V16, V16, V17
        200 vaddI V17, V18, V20
        204 vaddI V16, V16, V17
        208 addw R13, R10, #64
        20c cmpw R13, #456
        210 blt B15 // counted loop end P=0.984636 C=48127.000000

        214 B17: # out( B22 B18 ) &lt;- in( B16 ) Freq: 0.999989
        214 reduce_addI_sve R0, R2, V16 # KILL V17
        220 cmpw R13, #512
        224 bge B22 P=0.500000 C=-1.000000
        ```

        3.2 C2 JIT code for TestUnorderedReduction::test2 when test case fails.

        ```
        170 B15: # out( B16 ) &lt;- in( B16 ) top-of-loop Freq: 64.0845
        170 spill R13 -&gt; R11 # spill size = 32

        174 B16: # out( B15 B17 ) &lt;- in( B13 B15 ) Loop( B16-B15 inner main of N53) Freq: 65.0845
        174 add R13, R10, R11, I2L #2 # ptr
        178 loadV16 V17, [R13, #16] # vector (128 bits)
        17c loadV V18, [R13, #32] # vector (sve)
        180 vlsl_imm V19, V17, #1
        184 loadV V20, [R13, #64] # vector (sve)
        188 vaddI V17, V19, V17
        18c vlsl_imm V19, V18, #1
        190 vaddI V16, V16, V17
        194 vaddI V17, V19, V18
        198 loadV V18, [R13, #96] # vector (sve)
        19c vlsl_imm V19, V20, #1
        1a0 vaddI V16, V16, V17
        1a4 vaddI V17, V19, V20
        1a8 loadV16 V19, [R13, #128] # vector (128 bits)
        1ac vlsl_imm V20, V18, #1
        1b0 vaddI V16, V16, V17
        1b4 vaddI V17, V20, V18
        1b8 loadV16 V18, [R13, #144] # vector (128 bits)
        1bc vlsl_imm V20, V19, #1
        1c0 loadV V21, [R13, #160] # vector (sve)
        1c4 vaddI V19, V20, V19
        1c8 vaddI V16, V16, V17
        1cc vlsl_imm V17, V18, #1
        1d0 vaddI V16, V16, V19
        1d4 vaddI V17, V17, V18
        1d8 loadV V18, [R13, #192] # vector (sve)
        1dc vlsl_imm V19, V21, #1
        1e0 loadV V20, [R13, #224] # vector (sve)
        1e4 vaddI V19, V19, V21
        1e8 vaddI V16, V16, V17
        1ec vlsl_imm V17, V18, #1
        1f0 loadV16 V21, [R13, #256] # vector (128 bits)
        1f4 vaddI V17, V17, V18
        1f8 vaddI V16, V16, V19
        1fc vlsl_imm V18, V20, #1
        200 vaddI V16, V16, V17
        204 vaddI V17, V18, V20
        208 vlsl_imm V18, V21, #1
        20c vaddI V16, V16, V17
        210 vaddI V17, V18, V21
        214 vaddI V16, V16, V17
        218 addw R13, R11, #64
        21c cmpw R13, #456
        220 blt B15 // counted loop end P=0.984636 C=48127.000000

        224 B17: # out( B22 B18 ) &lt;- in( B16 ) Freq: 0.999989
        224 reduce_addI_neon R0, R2, V16 # KILL V17
        230 cmpw R13, #512
        234 bge B22 P=0.500000 C=-1.000000
        ```

        From the C2 JIT code, we can see that `reduce_addI_neon R0, R2, V16 # KILL V17 ` uses V16, which is generated from the above `vaddI V16, V16, V17` in the loop, but V16 and V17 have different vector length, which may result in omitted or over-processed data.

              epeter Emanuel Peter
              gcao Gui Cao
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: