-
Enhancement
-
Resolution: Unresolved
-
P4
-
24
(Apologies for an overly generic synopsis, we should sharpen this if we find solution/mitigation)
René Schwietzke noticed an interesting behavior in manual arraycopy benchmarks. The simplest reproducer is:
```
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class LoopCounterBench {
int increment;
long[] src, dest;
@Setup
public void setup() {
final int SIZE = 1000;
src = new long[SIZE];
dest = new long[SIZE];
increment = 1;
}
@Benchmark
public long[] field_ret() {
for (int i = 0; i < src.length; i = i + increment) {
dest[i] = src[i];
}
return dest;
}
@Benchmark
public long[] localVar_ret() {
final int inc = increment;
for (int i = 0; i < src.length; i = i + inc) {
dest[i] = src[i];
}
return dest;
}
}
```
...it yields:
```
Benchmark Mode Cnt Score Error Units
LoopCounterBench.field_ret avgt 5 604.758 ± 0.404 ns/op
LoopCounterBench.localVar_ret avgt 5 1625.441 ± 0.503 ns/op
```
This result is counter-intuitive: caching a field value in the local variable is significantly slower than using the field directly. `perfasm` shows the difference between fast and slow version is that slow version has the spills:
```
Fast:
↗ 0x...6d00: cmp %edx,%r8d
│ 0x...6d03: jae 0x...6d27
│ 0x...6d05: mov 0x10(%r13,%r8,8),%rax
│ 0x...6d0a: cmp %esi,%r8d
│ 0x...6d0d: jae 0x...6d60
│ 0x...6d0f: mov %rax,0x10(%r14,%r8,8)
│ 0x...6d14: add %ecx,%r8d
│ 0x...6d17: mov 0x450(%r15),%rax
│ 0x...6d1e: test %eax,(%rax)
│ 0x...6d20: cmp %edx,%r8d
╰ 0x...6d23: jl 0x...6d00
Slow:
↗ 0x...7390: vmovq %xmm0,%rbp ; <--- UNSPILL
│ 0x...7395: cmp %r10d,%edi
│ 0x...7398: jae 0x...7412
│ 0x...739a: vmovq %rbp,%xmm0 ; <--- SPILL
│ 0x...739f: mov 0x10(%rax,%rdi,8),%rbp
│ 0x...73a4: cmp %esi,%edi
│ 0x...73a6: jae 0x...7450
│ 0x...73ac: mov %rbp,0x10(%r13,%rdi,8)
│ 0x...73b1: add %r9d,%edi
│ 0x...73b4: mov 0x450(%r15),%rbp ; <--- %rbp is used for thread-local poll
│ 0x...73bb: test %eax,0x0(%rbp)
│ 0x...73be: cmp %r10d,%edi
╰ 0x...73c1: jl 0x...7390
```
René Schwietzke noticed an interesting behavior in manual arraycopy benchmarks. The simplest reproducer is:
```
@State(Scope.Thread)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
@Fork(1)
public class LoopCounterBench {
int increment;
long[] src, dest;
@Setup
public void setup() {
final int SIZE = 1000;
src = new long[SIZE];
dest = new long[SIZE];
increment = 1;
}
@Benchmark
public long[] field_ret() {
for (int i = 0; i < src.length; i = i + increment) {
dest[i] = src[i];
}
return dest;
}
@Benchmark
public long[] localVar_ret() {
final int inc = increment;
for (int i = 0; i < src.length; i = i + inc) {
dest[i] = src[i];
}
return dest;
}
}
```
...it yields:
```
Benchmark Mode Cnt Score Error Units
LoopCounterBench.field_ret avgt 5 604.758 ± 0.404 ns/op
LoopCounterBench.localVar_ret avgt 5 1625.441 ± 0.503 ns/op
```
This result is counter-intuitive: caching a field value in the local variable is significantly slower than using the field directly. `perfasm` shows the difference between fast and slow version is that slow version has the spills:
```
Fast:
↗ 0x...6d00: cmp %edx,%r8d
│ 0x...6d03: jae 0x...6d27
│ 0x...6d05: mov 0x10(%r13,%r8,8),%rax
│ 0x...6d0a: cmp %esi,%r8d
│ 0x...6d0d: jae 0x...6d60
│ 0x...6d0f: mov %rax,0x10(%r14,%r8,8)
│ 0x...6d14: add %ecx,%r8d
│ 0x...6d17: mov 0x450(%r15),%rax
│ 0x...6d1e: test %eax,(%rax)
│ 0x...6d20: cmp %edx,%r8d
╰ 0x...6d23: jl 0x...6d00
Slow:
↗ 0x...7390: vmovq %xmm0,%rbp ; <--- UNSPILL
│ 0x...7395: cmp %r10d,%edi
│ 0x...7398: jae 0x...7412
│ 0x...739a: vmovq %rbp,%xmm0 ; <--- SPILL
│ 0x...739f: mov 0x10(%rax,%rdi,8),%rbp
│ 0x...73a4: cmp %esi,%edi
│ 0x...73a6: jae 0x...7450
│ 0x...73ac: mov %rbp,0x10(%r13,%rdi,8)
│ 0x...73b1: add %r9d,%edi
│ 0x...73b4: mov 0x450(%r15),%rbp ; <--- %rbp is used for thread-local poll
│ 0x...73bb: test %eax,0x0(%rbp)
│ 0x...73be: cmp %r10d,%edi
╰ 0x...73c1: jl 0x...7390
```
- links to
-
Review(master) openjdk/jdk/21472