-
Enhancement
-
Resolution: Unresolved
-
P3
-
8u60, 9, 10
-
generic
-
generic
While stresstesting SPECjvm2008s sor.large benchmark with -XX:TieredStopAtLevel=3, increases in run times from the configured 4 minutes to more than 20 minutes could be observed.
Using a JMH port of the SOR[1] benchmark to narrow in on this, we observe excessive L1 cache misses, excessive cache coherency traffic and (using -prof perfasm) that we're spending most of the CPU time updating two method invocation counters:
0.51% 2.73% 0x00007f8f8cc1a60f: vmovsd %xmm0,0x10(%r8,%r14,8) ;*dastore
; - org.openjdk.bench.closed.app.javagrande.section2.benchclasses.SOR::SORrun@128 (line 47)
0.31% 3.01% 0x00007f8f8cc1a616: mov $0x7f8f88bcc900,%rax ; {metadata(method data for {method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}
0.68% 0.30% 0x00007f8f8cc1a620: mov 0xe0(%rax),%edx
17.21% 17.03% 0x00007f8f8cc1a626: add $0x8,%edx
1.24% 0.28% 0x00007f8f8cc1a629: mov %edx,0xe0(%rax)
1.03% 0.97% 0x00007f8f8cc1a62f: mov $0x7f8f88bcc718,%rax ; {metadata({method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}
0.50% 0.04% 0x00007f8f8cc1a639: and $0xfff8,%edx
0.00% 0.07% 0x00007f8f8cc1a63f: cmp $0x0,%edx
0x00007f8f8cc1a642: je 0x00007f8f8cc1a95f ; OopMap{r8=Oop r9=Oop r13=Oop [88]=Oop off=456}
;*goto
; - org.openjdk.bench.closed.app.javagrande.section2.benchclasses.SOR::SORrun@132 (line 46)
0.02% 0.09% 0x00007f8f8cc1a648: test %eax,0x15568ab2(%rip) # 0x00007f8fa2183100
; {poll}
0.11% 0.90% 0x00007f8f8cc1a64e: mov $0x7f8f88bcc900,%rax ; {metadata(method data for {method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}
2.26% 0.62% 0x00007f8f8cc1a658: incl 0x168(%rax)
38.96% 11.75% 0x00007f8f8cc1a65e: mov %rsi,%r14 ;*goto
; - org.openjdk.bench.closed.app.javagrande.section2.benchclasses.SOR::SORrun@132 (line 46)
0.24% 0.04% 0x00007f8f8cc1a661: cmp %ebx,%r14d
0.00% 0.06% 0x00007f8f8cc1a664: mov $0x7f8f88bcc900,%rdx ; {metadata(method data for {method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}
Testing using JDK8u60 on a dual-socket Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (2x8x2) system shows these approximate numbers when running in the mode SPECjvm2008 does (twice the number of Java threads compared to hardware threads, i.e., java -jar bencmarks.jar .*SOR.* -t 64):
-XX:TieredStopAtLevel=0: 12500 ms/op
-XX:TieredStopAtLevel=1: 52 ms/op
-XX:TieredStopAtLevel=2: 1000 ms/op
-XX:TieredStopAtLevel=3: 8400 ms/op # 160x compared to Level=1
-XX:TieredStopAtLevel=4: 28 ms/op
-Xint: 655 ms/op # like StopAtLevel=0 but without profiling, 20x faster
On a single-core Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz system (1x2x2) running with similar saturation (8 threads):
-XX:TieredStopAtLevel=0: 645 ms/op
-XX:TieredStopAtLevel=1: 24 ms/op
-XX:TieredStopAtLevel=2: 69 ms/op
-XX:TieredStopAtLevel=3: 185 ms/op
-XX:TieredStopAtLevel=4: 17 ms/op
-Xint: 305 ms/op
This might be a pretty extreme case, but there's indication this does contribute to increased run-to-run variance on a variety of SPECjvm2008 benchmarks, and could be contributing negatively to time-to-performance and variance on a large variety of workloads.
In the absence of a "branch-on-random" hardware instruction[2], we should consider solutions to try and avoid direct updates to shared state, such as striping (the hottest) invocation counters across cores/sockets or queuing up writes to a thread local buffer. Footprint impact needs to be considered, and perhaps there might be a need for some users to be able to disable whatever mechanism is implemented.
ILW = Worst case may cause long delays in program warmup and execution(M), possibly affects time-to-performance and run-time variance on a large set of benchmarks (M), disabling TieredCompilation may or may not help (M) = MMM = P3
[1] http://cr.openjdk.java.net/~redestad/scratch/sor-jmh/
[2] http://dl.acm.org/citation.cfm?id=1356070
Using a JMH port of the SOR[1] benchmark to narrow in on this, we observe excessive L1 cache misses, excessive cache coherency traffic and (using -prof perfasm) that we're spending most of the CPU time updating two method invocation counters:
0.51% 2.73% 0x00007f8f8cc1a60f: vmovsd %xmm0,0x10(%r8,%r14,8) ;*dastore
; - org.openjdk.bench.closed.app.javagrande.section2.benchclasses.SOR::SORrun@128 (line 47)
0.31% 3.01% 0x00007f8f8cc1a616: mov $0x7f8f88bcc900,%rax ; {metadata(method data for {method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}
0.68% 0.30% 0x00007f8f8cc1a620: mov 0xe0(%rax),%edx
17.21% 17.03% 0x00007f8f8cc1a626: add $0x8,%edx
1.24% 0.28% 0x00007f8f8cc1a629: mov %edx,0xe0(%rax)
1.03% 0.97% 0x00007f8f8cc1a62f: mov $0x7f8f88bcc718,%rax ; {metadata({method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}
0.50% 0.04% 0x00007f8f8cc1a639: and $0xfff8,%edx
0.00% 0.07% 0x00007f8f8cc1a63f: cmp $0x0,%edx
0x00007f8f8cc1a642: je 0x00007f8f8cc1a95f ; OopMap{r8=Oop r9=Oop r13=Oop [88]=Oop off=456}
;*goto
; - org.openjdk.bench.closed.app.javagrande.section2.benchclasses.SOR::SORrun@132 (line 46)
0.02% 0.09% 0x00007f8f8cc1a648: test %eax,0x15568ab2(%rip) # 0x00007f8fa2183100
; {poll}
0.11% 0.90% 0x00007f8f8cc1a64e: mov $0x7f8f88bcc900,%rax ; {metadata(method data for {method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}
2.26% 0.62% 0x00007f8f8cc1a658: incl 0x168(%rax)
38.96% 11.75% 0x00007f8f8cc1a65e: mov %rsi,%r14 ;*goto
; - org.openjdk.bench.closed.app.javagrande.section2.benchclasses.SOR::SORrun@132 (line 46)
0.24% 0.04% 0x00007f8f8cc1a661: cmp %ebx,%r14d
0.00% 0.06% 0x00007f8f8cc1a664: mov $0x7f8f88bcc900,%rdx ; {metadata(method data for {method} {0x00007f8f88bcc718} 'SORrun' '(D[[DI)V' in 'org/openjdk/bench/closed/app/javagrande/section2/benchclasses/SOR')}
Testing using JDK8u60 on a dual-socket Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz (2x8x2) system shows these approximate numbers when running in the mode SPECjvm2008 does (twice the number of Java threads compared to hardware threads, i.e., java -jar bencmarks.jar .*SOR.* -t 64):
-XX:TieredStopAtLevel=0: 12500 ms/op
-XX:TieredStopAtLevel=1: 52 ms/op
-XX:TieredStopAtLevel=2: 1000 ms/op
-XX:TieredStopAtLevel=3: 8400 ms/op # 160x compared to Level=1
-XX:TieredStopAtLevel=4: 28 ms/op
-Xint: 655 ms/op # like StopAtLevel=0 but without profiling, 20x faster
On a single-core Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz system (1x2x2) running with similar saturation (8 threads):
-XX:TieredStopAtLevel=0: 645 ms/op
-XX:TieredStopAtLevel=1: 24 ms/op
-XX:TieredStopAtLevel=2: 69 ms/op
-XX:TieredStopAtLevel=3: 185 ms/op
-XX:TieredStopAtLevel=4: 17 ms/op
-Xint: 305 ms/op
This might be a pretty extreme case, but there's indication this does contribute to increased run-to-run variance on a variety of SPECjvm2008 benchmarks, and could be contributing negatively to time-to-performance and variance on a large variety of workloads.
In the absence of a "branch-on-random" hardware instruction[2], we should consider solutions to try and avoid direct updates to shared state, such as striping (the hottest) invocation counters across cores/sockets or queuing up writes to a thread local buffer. Footprint impact needs to be considered, and perhaps there might be a need for some users to be able to disable whatever mechanism is implemented.
ILW = Worst case may cause long delays in program warmup and execution(M), possibly affects time-to-performance and run-time variance on a large set of benchmarks (M), disabling TieredCompilation may or may not help (M) = MMM = P3
[1] http://cr.openjdk.java.net/~redestad/scratch/sor-jmh/
[2] http://dl.acm.org/citation.cfm?id=1356070