C2 VectorAPI AVX512: wrong result with 32B and 64B MUL reduction

XMLWordPrintable

      Found while working on JDK-8369699.

      I could only reproduce it with AVX512, and not even with AVX2.
      It also reproduced all the way back to JDK16, to the build after JDK-8223347, so this seems to be a bug since the beginning of the Vector API.

      Reproduce:
      java -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=printcompilation,Test::test -Xbatch Test.java
      ...
      gold = -3
      val = 1
      Exception in thread "main" java.lang.RuntimeException: bad value
      at Test.main(Test.java:17)

      For more info:
      java -XX:CompileCommand=compileonly,Test::test -XX:CompileCommand=printcompilation,Test::test -XX:CompileCommand=printassembly,Test::test -Xbatch -XX:+TraceNewVectors -XX:+PrintIdeal Test.java


      ----------------------------------


      I suspect the bug is somewhere around the 32B mul reduction, something seems to go wrong with the registers.
      The code below is generated in C2_MacroAssembler::mulreduce32B, by the AVX512 part.

        0x00007ffab8bb7734: vmovdqu 0x10(%rsi),%ymm0

      Load 32B values from array:
      ymm0 = 0x...101010101010101010101010101fd01

        0x00007ffab8bb7739: movabs $0x623aa7050,%r10 ; {oop(a 'java/lang/Byte'[256] {0x0000000623aa7050})}
        0x00007ffab8bb7743: mov $0x1,%r11d
        0x00007ffab8bb7749: vpmovsxbw %ymm0,%zmm2

      Cast the ymm0 with 32 byte values to zmm2 with 32 shorts, so we can do short reduction.
      zmm2 = 0x...100010001000100010001fffd0001

        0x00007ffab8bb774f: vextracti64x4 $0x1,%zmm2,%ymm2

      Take the upper half of zmm2 (all 0x0001 values), and store it over the lower half (ymm2).
      The problem: the lower half is now lost. The 0xfffd value is now only to be found in ymm0,
      but we never use that value again, and so we lose the -3 (0xfffd) value.

      zmm2 = v4_int128 = {0x10001000100010001000100010001, 0x10001000100010001000100010001, 0x0, 0x0}

        0x00007ffab8bb7756: vpmullw %ymm2,%ymm2,%ymm2
        0x00007ffab8bb775a: vextracti128 $0x1,%ymm2,%xmm1
        0x00007ffab8bb7760: vpmullw %xmm2,%xmm1,%xmm1
        0x00007ffab8bb7764: vpshufd $0xe,%xmm1,%xmm2
        0x00007ffab8bb7769: vpmullw %xmm1,%xmm2,%xmm2
        0x00007ffab8bb776d: vpshufd $0x1,%xmm2,%xmm1
        0x00007ffab8bb7772: vpmullw %xmm2,%xmm1,%xmm1
        0x00007ffab8bb7776: vmovdqu %xmm1,%xmm2
        0x00007ffab8bb777a: vpsrldq $0x2,%xmm2,%xmm2
        0x00007ffab8bb777f: vpmullw %xmm1,%xmm2,%xmm2
        0x00007ffab8bb7783: vmovd %r11d,%xmm1
        0x00007ffab8bb7788: vpmovsxwd %xmm2,%xmm2
        0x00007ffab8bb778d: vpmulld %xmm1,%xmm2,%xmm2
        0x00007ffab8bb7792: vpextrw $0x0,%xmm2,%r9d
        0x00007ffab8bb7797: movswl %r9w,%r9d
        0x00007ffab8bb779b: movsbl %r9b,%r11d
        0x00007ffab8bb779f: movslq %r11d,%r11
        0x00007ffab8bb77a2: mov 0x210(%r10,%r11,4),%r11d
        0x00007ffab8bb77aa: lea (%r12,%r11,8),%rax
        0x00007ffab8bb77ae: vzeroupper
        0x00007ffab8bb77b1: add $0x20,%rsp
        0x00007ffab8bb77b5: pop %rbp
        0x00007ffab8bb77b6: cmp 0x28(%r15),%rsp ; {poll_return}
        0x00007ffab8bb77ba: ja 0x00007ffab8bb7854
        0x00007ffab8bb77c0: retq


      ----------------------------------------


      I'm a bit confused why such a bug for such a simple (trivial) code shape has not been found before. For example, this test was supposed to catch it:
      test/jdk/jdk/incubator/vector/Byte256VectorTests.java


      Aaaah, I see why! Our generators always lead to a zero reduction:
        1104 static final List<IntFunction<byte[]>> BYTE_GENERATORS = List.of(
        1105 withToString("byte[-i * 5]", (int s) -> {
        1106 return fill(s * BUFFER_REPS,
        1107 i -> (byte)(-i * 5));
        1108 }),
        1109 withToString("byte[i * 5]", (int s) -> {
        1110 return fill(s * BUFFER_REPS,
        1111 i -> (byte)(i * 5));
        1112 }),
        1113 withToString("byte[i + 1]", (int s) -> {
        1114 return fill(s * BUFFER_REPS,
        1115 i -> (((byte)(i + 1) == 0) ? 1 : (byte)(i + 1)));
        1116 }),
        1117 withToString("byte[cornerCaseValue(i)]", (int s) -> {
        1118 return fill(s * BUFFER_REPS,
        1119 i -> cornerCaseValue(i));
        1120 })
        1121 );

      You only need 8 even numbers to overflow all bits, or a single zero. Here, we have 32B values in a vector. Even the upper half of those values (the part we extract down) now has 16B values. If half of them are even, we get 8 bits shifted up, and so we have no non-zero bits left. If we only contain a single zero in the generator numbers, we also get a zero result.

      So now I added an additional generator to the test, and the test indeed fails!

      diff --git a/test/jdk/jdk/incubator/vector/Byte256VectorTests.java b/test/jdk/jdk/incubator/vector/Byte256VectorTests.java
      index ec8958e0605..d7529902afc 100644
      --- a/test/jdk/jdk/incubator/vector/Byte256VectorTests.java
      +++ b/test/jdk/jdk/incubator/vector/Byte256VectorTests.java
      @@ -1114,6 +1114,10 @@ static byte bits(byte e) {
                       return fill(s * BUFFER_REPS,
                                   i -> (((byte)(i + 1) == 0) ? 1 : (byte)(i + 1)));
                   }),
      + withToString("special", (int s) -> {
      + return fill(s * BUFFER_REPS,
      + i -> (i == 1) ? (byte)3 : (byte)1);
      + }),
                   withToString("byte[cornerCaseValue(i)]", (int s) -> {
                       return fill(s * BUFFER_REPS,
                                   i -> cornerCaseValue(i));

      My conclusion: the test generators here are very weak, and NOT SUFFICIENT to give us sufficient coverage.
      That is a bit frustrating, but luckily we now are working on JDK-8369699 which will give us much better coverage,
      especially because we also test with random inputs.

        1. Test.java
          0.8 kB
          Emanuel Peter

            Assignee:
            Unassigned
            Reporter:
            Emanuel Peter
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: