Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Unresolved
Priority: P4
Fix Version/s: tbd
Affects Version/s: 17, 20, 21
Component/s: hotspot
Labels:

Subcomponent:
compiler

In one of the production workloads, we have seen a hot path through the `OptoRuntime::multianewarray2_C`. The focused benchmark shows that allocating the array piece-wise is significantly faster, because it does not call into runtime.

```
public class MultiArrayAlloc {
    @Param({"1", "2", "4", "8", "16", "32",
            "64", "128", "256", "512", "1024",
            "2048", "4096", "8192"})
    int size;

    @Benchmark
    public Object full() {
        return new int[size][size];
    }

    @Benchmark
    public Object piecemeal() {
        int[][] ints = new int[size][];
        for (int c = 0; c < size; c++) {
            ints[c] = new int[size];
        }
        return ints;
    }
}
```

For example, for size=4, "full" alloc is 10x slower than "piecemeal" on Mac M1:

```
Benchmark (size) Mode Cnt Score Error Units
MultiArrayAlloc.full 4 avgt 5 129,543 ± 3,555 ns/op
MultiArrayAlloc.piecemeal 4 avgt 5 12,665 ± 0,297 ns/op
```

The mitigation for this trouble would be rewriting the Java code paths to use piecewise initialization. We should and could probably improve the runtime paths to alleviate the need for such rewrite.

This is a multi-faceted issue. This reproduces on both C1 and C2, and the bottleneck is in `ObjArrayKlass::multi_allocate`. In interpreter, "piecemeal" is slower, because it enters runtime for allocation too. The overhead of the actual runtime call is high, especially on W^X systems like MacOS AArch64. The overhead of VM code is high, because it deals with allocation notifications (~~JDK-8308231~~), has several non-inlineable virtual calls to construct the multi-array, deals with less efficient zeroing routines, has to check runtime flags (notably in GC barriers). It is very hard to compete with JIT-generated 1-D array allocation code that "piecemeal" does.

In C2, there is the optimization to use 1-D allocator for smaller multi-arrays (gated by MultiArrayExpandLimit and even then capped at 100), but it only works for small arrays of known length, which is insufficient for a generic case.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

multiarray-alloc-full-size16.png
405 kB
2023-05-16 12:26
multiarray-alloc-full-size4.png
395 kB
2023-05-16 12:27

relates to

JDK-8308118 Avoid multiarray allocations in AESCrypt.makeSessionKey

Resolved

JDK-8308231 Faster MemAllocator::Allocation checks for verify/notification

Closed

JDK-8347606 Optimize Java implementation of ML-DSA

Resolved

JDK-8347608 Optimize Java implementation of ML-KEM

Resolved

Assignee:: Unassigned

Reporter:: Aleksey Shipilev

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2023-05-15 08:03

Updated:: 2025-07-11 10:23

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates