Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8341697

C2: Register allocation inefficiency in tight loop

XMLWordPrintable

      (Apologies for an overly generic synopsis, we should sharpen this if we find solution/mitigation)

      René Schwietzke noticed an interesting behavior in manual arraycopy benchmarks. The simplest reproducer is:

      ```
      @State(Scope.Thread)
      @BenchmarkMode(Mode.AverageTime)
      @OutputTimeUnit(TimeUnit.NANOSECONDS)
      @Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
      @Measurement(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS)
      @Fork(1)
      public class LoopCounterBench {
          int increment;
          long[] src, dest;

          @Setup
          public void setup() {
              final int SIZE = 1000;
              src = new long[SIZE];
              dest = new long[SIZE];
              increment = 1;
          }

          @Benchmark
          public long[] field_ret() {
              for (int i = 0; i < src.length; i = i + increment) {
                  dest[i] = src[i];
              }
              return dest;
          }

          @Benchmark
          public long[] localVar_ret() {
              final int inc = increment;
              for (int i = 0; i < src.length; i = i + inc) {
                  dest[i] = src[i];
              }
              return dest;
          }
      }
      ```

      ...it yields:

      ```
      Benchmark Mode Cnt Score Error Units
      LoopCounterBench.field_ret avgt 5 604.758 ± 0.404 ns/op
      LoopCounterBench.localVar_ret avgt 5 1625.441 ± 0.503 ns/op
      ```

      This result is counter-intuitive: caching a field value in the local variable is significantly slower than using the field directly. `perfasm` shows the difference between fast and slow version is that slow version has the spills:

      ```
      Fast:
       ↗ 0x...6d00: cmp %edx,%r8d
       │ 0x...6d03: jae 0x...6d27
       │ 0x...6d05: mov 0x10(%r13,%r8,8),%rax
       │ 0x...6d0a: cmp %esi,%r8d
       │ 0x...6d0d: jae 0x...6d60
       │ 0x...6d0f: mov %rax,0x10(%r14,%r8,8)
       │ 0x...6d14: add %ecx,%r8d
       │ 0x...6d17: mov 0x450(%r15),%rax
       │ 0x...6d1e: test %eax,(%rax)
       │ 0x...6d20: cmp %edx,%r8d
       ╰ 0x...6d23: jl 0x...6d00


      Slow:
       ↗ 0x...7390: vmovq %xmm0,%rbp ; <--- UNSPILL
       │ 0x...7395: cmp %r10d,%edi
       │ 0x...7398: jae 0x...7412
       │ 0x...739a: vmovq %rbp,%xmm0 ; <--- SPILL
       │ 0x...739f: mov 0x10(%rax,%rdi,8),%rbp
       │ 0x...73a4: cmp %esi,%edi
       │ 0x...73a6: jae 0x...7450
       │ 0x...73ac: mov %rbp,0x10(%r13,%rdi,8)
       │ 0x...73b1: add %r9d,%edi
       │ 0x...73b4: mov 0x450(%r15),%rbp ; <--- %rbp is used for thread-local poll
       │ 0x...73bb: test %eax,0x0(%rbp)
       │ 0x...73be: cmp %r10d,%edi
       ╰ 0x...73c1: jl 0x...7390
      ```

            qamai Quan Anh Mai
            shade Aleksey Shipilev
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: