Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8304720

SuperWord::schedule should rebuild C2-graph from SuperWord dependency-graph

XMLWordPrintable

    • b22

      During work of JDK-8304042 (and ultimately of JDK-8298935), I found an example where we have no cyclic dependency, but the C2 graph after SuperWord is not correct. It leads to WRONG RESULTS.

      The issue is that we do not re-order the graph for non-vectorized memory operations. But the vectorization of some memory operations may lead to a reordering of memory ops, not just for the vectorized ones but also the unvectorized ones.

      In SuperWord::co_locate_pack we have some complicated logic for Stores, where we try to find out what needs to be scheduled before (and presumably after). But we have nothing equivalent for Load.

      We could try to implement something equivalent to the logic for Stores. But I am not convinced that this works, because we might have to also re-order stores that are not packed.

      My suggestion:
      We schedule all packs and left over scalars into a linear order (I already do that in JDK-8304042). From that order, we can re-construct all memory-state edges in the C2 memory slices.

      Reproduce like this:

      ./java -XX:-TieredCompilation -Xbatch -XX:CompileCommand=compileonly,Test4::test -XX:CompileCommand=printcompilation,Test4::* -XX:LoopUnrollLimit=250 -XX:+TraceSuperWord Test4.java

      ./java -XX:-TieredCompilation -Xbatch --add-modules java.base --add-exports java.base/jdk.internal.misc=ALL-UNNAMED -XX:CompileCommand=compileonly,Test5::test -XX:CompileCommand=printcompilation,Test5::* -XX:LoopUnrollLimit=250 -XX:+TraceSuperWord -XX:LoopMaxUnroll=5 Test5.java

      I will describe Test4.java. Test5.java is very similar, just uses Unsafe so that it reproduces before we implemented conversions in SuperWord.

      We have:

              for (int i = 0; i < RANGE; i+=2) {
                  dataFa[i + 0] = dataIa[i + 0] * 1.3f;
                  dataIb[i + 0] = (int)dataFb[i + 0] * 11;
                  dataIb[i + 1] = (int)dataFb[i + 1] * 11;
                  dataFa[i + 1] = dataIa[i + 1] * 1.2f;
              }

      This gets unrolled, for simplicity let's say only 2x.

              for (int i = 0; i < RANGE; i+=4) {
                  dataFa[i + 0] = dataIa[i + 0] * 1.3f; -> scalar, before
                  dataIb[i + 0] = (int)dataFb[i + 0] * 11; -> vectorize
                  dataIb[i + 1] = (int)dataFb[i + 1] * 11; -> vectorize
                  dataFa[i + 1] = dataIa[i + 1] * 1.2f; -> scalar, after
                  dataFa[i + 2] = dataIa[i + 2] * 1.3f; -> scalar, before
                  dataIb[i + 2] = (int)dataFb[i + 2] * 11; -> vectorize
                  dataIb[i + 3] = (int)dataFb[i + 3] * 11; -> vectorize
                  dataFa[i + 3] = dataIa[i + 3] * 1.2f; -> scalar, after
              }

      We can only vectorize half of the statements, the other half does not work because the constants are not matching.

      To enable this vectorization, the statements have to be re-ordered.
      We have to pay attention to the Store-Load dependencies.
      For (i+0) and (i+2), we have a StoreF-LoadF dependency ("before"), for (i+1) and (i+3) we have a StoreI-LoadI dependency ("after").

              for (int i = 0; i < RANGE; i+=4) {
                  dataFa[i + 0] = dataIa[i + 0] * 1.3f; -> scalar, before
                  dataFa[i + 2] = dataIa[i + 2] * 1.3f; -> scalar, before
                  dataIb[i + 0] = (int)dataFb[i + 0] * 11; -> vectorize
                  dataIb[i + 1] = (int)dataFb[i + 1] * 11; -> vectorize
                  dataIb[i + 2] = (int)dataFb[i + 2] * 11; -> vectorize
                  dataIb[i + 3] = (int)dataFb[i + 3] * 11; -> vectorize
                  dataFa[i + 1] = dataIa[i + 1] * 1.2f; -> scalar, after
                  dataFa[i + 3] = dataIa[i + 3] * 1.2f; -> scalar, after
              }

      We see that we had to re-order the memory ops. For example we had to swap the order of the StoreF for (i+1) and (i+2). This must be reflected in the C2 graph. Some of the scalar memory ops must be moved before the vector load/stores, some must be pushed to later. This should be determined by the dependency-graph of SuperWord.

      I think my Test5.java fails for JDK17.

      Task:
      Un-comment "test5" from JDK-8304042.
      And lift the platform restrictions at the test definitions.

            epeter Emanuel Peter
            epeter Emanuel Peter
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: