-
Enhancement
-
Resolution: Unresolved
-
P4
-
17, 20, 21
In one of the production workloads, we have seen a hot path through the `OptoRuntime::multianewarray2_C`. The focused benchmark shows that allocating the array piece-wise is significantly faster, because it does not call into runtime.
```
public class MultiArrayAlloc {
@Param({"1", "2", "4", "8", "16", "32",
"64", "128", "256", "512", "1024",
"2048", "4096", "8192"})
int size;
@Benchmark
public Object full() {
return new int[size][size];
}
@Benchmark
public Object piecemeal() {
int[][] ints = new int[size][];
for (int c = 0; c < size; c++) {
ints[c] = new int[size];
}
return ints;
}
}
```
For example, for size=4, "full" alloc is 10x slower than "piecemeal" on Mac M1:
```
Benchmark (size) Mode Cnt Score Error Units
MultiArrayAlloc.full 4 avgt 5 129,543 ± 3,555 ns/op
MultiArrayAlloc.piecemeal 4 avgt 5 12,665 ± 0,297 ns/op
```
The mitigation for this trouble would be rewriting the Java code paths to use piecewise initialization. We should and could probably improve the runtime paths to alleviate the need for such rewrite.
This is a multi-faceted issue. This reproduces on both C1 and C2, and the bottleneck is in `ObjArrayKlass::multi_allocate`. In interpreter, "piecemeal" is slower, because it enters runtime for allocation too. The overhead of the actual runtime call is high, especially on W^X systems like MacOS AArch64. The overhead of VM code is high, because it deals with allocation notifications (JDK-8308231), has several non-inlineable virtual calls to construct the multi-array, deals with less efficient zeroing routines, has to check runtime flags (notably in GC barriers). It is very hard to compete with JIT-generated 1-D array allocation code that "piecemeal" does.
In C2, there is the optimization to use 1-D allocator for smaller multi-arrays (gated by MultiArrayExpandLimit and even then capped at 100), but it only works for small arrays of known length, which is insufficient for a generic case.
```
public class MultiArrayAlloc {
@Param({"1", "2", "4", "8", "16", "32",
"64", "128", "256", "512", "1024",
"2048", "4096", "8192"})
int size;
@Benchmark
public Object full() {
return new int[size][size];
}
@Benchmark
public Object piecemeal() {
int[][] ints = new int[size][];
for (int c = 0; c < size; c++) {
ints[c] = new int[size];
}
return ints;
}
}
```
For example, for size=4, "full" alloc is 10x slower than "piecemeal" on Mac M1:
```
Benchmark (size) Mode Cnt Score Error Units
MultiArrayAlloc.full 4 avgt 5 129,543 ± 3,555 ns/op
MultiArrayAlloc.piecemeal 4 avgt 5 12,665 ± 0,297 ns/op
```
The mitigation for this trouble would be rewriting the Java code paths to use piecewise initialization. We should and could probably improve the runtime paths to alleviate the need for such rewrite.
This is a multi-faceted issue. This reproduces on both C1 and C2, and the bottleneck is in `ObjArrayKlass::multi_allocate`. In interpreter, "piecemeal" is slower, because it enters runtime for allocation too. The overhead of the actual runtime call is high, especially on W^X systems like MacOS AArch64. The overhead of VM code is high, because it deals with allocation notifications (
In C2, there is the optimization to use 1-D allocator for smaller multi-arrays (gated by MultiArrayExpandLimit and even then capped at 100), but it only works for small arrays of known length, which is insufficient for a generic case.
- relates to
-
JDK-8308118 Avoid multiarray allocations in AESCrypt.makeSessionKey
- Resolved
-
JDK-8308231 Faster MemAllocator::Allocation checks for verify/notification
- Closed