Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8134940

TieredCompilation profiling can exhibit poor scalability

XMLWordPrintable

    • generic
    • generic

      While stresstesting SPECjvm2008s sor.large benchmark with -XX:TieredStopAtLevel=3, increases in run times from the configured 4 minutes to more than 20 minutes could be observed.

      Using a JMH port of the SOR[1] benchmark to narrow in on this, we observe excessive L1 cache misses, excessive cache coherency traffic and (using -prof perfasm) that we're spending most of the CPU time updating two method invocation counters:

        0.51% 2.73% 0x00007f8f8cc1a60f: vmovsd %xmm0,0x10(%r8,%r14,8) ;*dastore
      ; - org.openjdk.bench.closed.app.javagrande.section2.benchclasses.SOR::SORrun@128 (line 47)
        0.31% 3.01% 0x00007f8f8cc1a616: mov $0x7f8f88bcc900,%rax ; {metadata(method data for {method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}
        0.68% 0.30% 0x00007f8f8cc1a620: mov 0xe0(%rax),%edx
       17.21% 17.03% 0x00007f8f8cc1a626: add $0x8,%edx
        1.24% 0.28% 0x00007f8f8cc1a629: mov %edx,0xe0(%rax)
        1.03% 0.97% 0x00007f8f8cc1a62f: mov $0x7f8f88bcc718,%rax ; {metadata({method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}
        0.50% 0.04% 0x00007f8f8cc1a639: and $0xfff8,%edx
        0.00% 0.07% 0x00007f8f8cc1a63f: cmp $0x0,%edx
                          0x00007f8f8cc1a642: je 0x00007f8f8cc1a95f ; OopMap{r8=Oop r9=Oop r13=Oop [88]=Oop off=456}
      ;*goto
      ; - org.openjdk.bench.closed.app.javagrande.section2.benchclasses.SOR::SORrun@132 (line 46)
        0.02% 0.09% 0x00007f8f8cc1a648: test %eax,0x15568ab2(%rip) # 0x00007f8fa2183100
      ; {poll}
        0.11% 0.90% 0x00007f8f8cc1a64e: mov $0x7f8f88bcc900,%rax ; {metadata(method data for {method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}
        2.26% 0.62% 0x00007f8f8cc1a658: incl 0x168(%rax)
       38.96% 11.75% 0x00007f8f8cc1a65e: mov %rsi,%r14 ;*goto
      ; - org.openjdk.bench.closed.app.javagrande.section2.benchclasses.SOR::SORrun@132 (line 46)
        0.24% 0.04% 0x00007f8f8cc1a661: cmp %ebx,%r14d
        0.00% 0.06% 0x00007f8f8cc1a664: mov $0x7f8f88bcc900,%rdx ; {metadata(method data for {method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}


      Testing using JDK8u60 on a dual-socket Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (2x8x2) system shows these approximate numbers when running in the mode SPECjvm2008 does (twice the number of Java threads compared to hardware threads, i.e., java -jar bencmarks.jar .*SOR.* -t 64):

      -XX:TieredStopAtLevel=0: 12500 ms/op
      -XX:TieredStopAtLevel=1: 52 ms/op
      -XX:TieredStopAtLevel=2: 1000 ms/op
      -XX:TieredStopAtLevel=3: 8400 ms/op # 160x compared to Level=1
      -XX:TieredStopAtLevel=4: 28 ms/op
      -Xint: 655 ms/op # like StopAtLevel=0 but without profiling, 20x faster

      On a single-core Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz system (1x2x2) running with similar saturation (8 threads):

      -XX:TieredStopAtLevel=0: 645 ms/op
      -XX:TieredStopAtLevel=1: 24 ms/op
      -XX:TieredStopAtLevel=2: 69 ms/op
      -XX:TieredStopAtLevel=3: 185 ms/op
      -XX:TieredStopAtLevel=4: 17 ms/op
      -Xint: 305 ms/op

      This might be a pretty extreme case, but there's indication this does contribute to increased run-to-run variance on a variety of SPECjvm2008 benchmarks, and could be contributing negatively to time-to-performance and variance on a large variety of workloads.

      In the absence of a "branch-on-random" hardware instruction[2], we should consider solutions to try and avoid direct updates to shared state, such as striping (the hottest) invocation counters across cores/sockets or queuing up writes to a thread local buffer. Footprint impact needs to be considered, and perhaps there might be a need for some users to be able to disable whatever mechanism is implemented.

      ILW = Worst case may cause long delays in program warmup and execution(M), possibly affects time-to-performance and run-time variance on a large set of benchmarks (M), disabling TieredCompilation may or may not help (M) = MMM = P3

      [1] http://cr.openjdk.java.net/~redestad/scratch/sor-jmh/
      [2] http://dl.acm.org/citation.cfm?id=1356070

            iveresov Igor Veresov
            redestad Claes Redestad
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: