Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8308105

Multi-array allocations are very slow

    XMLWordPrintable

Details

    Description

      In one of the production workloads, we have seen a hot path through the `OptoRuntime::multianewarray2_C`. The focused benchmark shows that allocating the array piece-wise is significantly faster, because it does not call into runtime.

      ```
      public class MultiArrayAlloc {
          @Param({"1", "2", "4", "8", "16", "32",
                  "64", "128", "256", "512", "1024",
                  "2048", "4096", "8192"})
          int size;

          @Benchmark
          public Object full() {
              return new int[size][size];
          }

          @Benchmark
          public Object piecemeal() {
              int[][] ints = new int[size][];
              for (int c = 0; c < size; c++) {
                  ints[c] = new int[size];
              }
              return ints;
          }
      }
      ```

      For example, for size=4, "full" alloc is 10x slower than "piecemeal" on Mac M1:

      ```
      Benchmark (size) Mode Cnt Score Error Units
      MultiArrayAlloc.full 4 avgt 5 129,543 ± 3,555 ns/op
      MultiArrayAlloc.piecemeal 4 avgt 5 12,665 ± 0,297 ns/op
      ```

      The mitigation for this trouble would be rewriting the Java code paths to use piecewise initialization. We should and could probably improve the runtime paths to alleviate the need for such rewrite.

      This is a multi-faceted issue. This reproduces on both C1 and C2, and the bottleneck is in `ObjArrayKlass::multi_allocate`. In interpreter, "piecemeal" is slower, because it enters runtime for allocation too. The overhead of the actual runtime call is high, especially on W^X systems like MacOS AArch64. The overhead of VM code is high, because it deals with allocation notifications (JDK-8308231), has several non-inlineable virtual calls to construct the multi-array, deals with less efficient zeroing routines, has to check runtime flags (notably in GC barriers). It is very hard to compete with JIT-generated 1-D array allocation code that "piecemeal" does.

      In C2, there is the optimization to use 1-D allocator for smaller multi-arrays (gated by MultiArrayExpandLimit and even then capped at 100), but it only works for small arrays of known length, which is insufficient for a generic case.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              shade Aleksey Shipilev
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: