-
Bug
-
Resolution: Fixed
-
P4
-
9, 10, 11, 12, 13
-
b20
-
generic
-
linux
## Symptom
~15% performance degradation (from 700 ops/m to 600 ops/m) was observed randomly on x86 while running SPECjvm2008's scimark.monte_carlo with -XX:-TieredCompilation.
## Reproduce
It can be always reproduced with the script[1] in less than 5 minutes.
## Reason
The drop was caused by a not-inline decisiion on spec.benchmarks.scimark.utils.Random::<init> in spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate.
If performance drop occurs:
-----------------------------------------------------
336 71 spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate (68 bytes)
@ 6 spec.benchmarks.scimark.utils.Random::<init> (53 bytes) call site not reached
s @ 22 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
s @ 28 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
-----------------------------------------------------
If no performance drop:
-----------------------------------------------------
368 71 spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate (68 bytes)
@ 6 spec.benchmarks.scimark.utils.Random::<init> (53 bytes) inline (hot)
@ 1 java.lang.Object::<init> (1 bytes) inline (hot)
@ 49 spec.benchmarks.scimark.utils.Random::initialize (125 bytes) inline (hot)
@ 14 java.lang.Math::abs (11 bytes) executed < MinInliningThreshold times
@ 19 java.lang.Math::min (11 bytes) (intrinsic)
s @ 22 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
s @ 28 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
-----------------------------------------------------
The not-inline decisiion was made by a heuristic here[2].
It was designed not to inline unreached callsites based on profile.count=0 only.
For callers with loops, the profile.count=0 for the callsite may be incorrect and misleading.
Inline decisions based on misleading profile info only may lead to unoptimized compile code.
Actually, the preformance drop of scimark.monte_carlo was just in that case.
The code of spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate
-----------------------------------------------------
public final double integrate(int numSamples) {
Random R = new Random(SEED);
int underCurve = 0;
for (int count = 0; count < numSamples; count++) {
double x = R.nextDouble();
double y = R.nextDouble();
if ( x*x + y*y <= 1.0) {
underCurve ++;
}
}
return ((double) underCurve / numSamples) * 4.0;
}
-----------------------------------------------------
The profile info when performance drop happened
-----------------------------------------------------
0 new 7 <spec/benchmarks/scimark/utils/Random>
3 dup
4 bipush 113
6 invokespecial 8 <spec/benchmarks/scimark/utils/Random.<init>(I)V>
0 bci: 6 CounterData count(0)
9 astore_2
10 iconst_0
11 istore_3
12 iconst_0
13 istore #4
15 fast_iload #4
17 iload_1
18 if_icmpge 58
16 bci: 18 BranchData trap(intrinsic_or_type_checked_inlining recompiled) taken(1) displacement(200)
not taken(57586)
21 aload_2
22 invokevirtual 9 <spec/benchmarks/scimark/utils/Random.nextDouble()D>
48 bci: 22 VirtualCallData count(60212) nonprofiled_count(0) entries(0)
method_entries(0)
25 dstore #5
27 aload_2
28 invokevirtual 9 <spec/benchmarks/scimark/utils/Random.nextDouble()D>
104 bci: 28 VirtualCallData count(54941) nonprofiled_count(0) entries(0)
method_entries(0)
31 dstore #7
33 dload #5
35 dload #5
37 dmul
38 dload #7
40 dload #7
42 dmul
43 dadd
44 dconst_1
45 dcmpg
46 ifgt 52
160 bci: 46 BranchData taken(16747) displacement(32)
not taken(46866)
49 iinc #3 1
52 iinc #4 1
55 goto 15
192 bci: 55 JumpData taken(58368) displacement(-176)
58 iload_3
59 i2d
60 iload_1
61 i2d
62 ddiv
63 ldc2_w 4.000000
66 dmul
67 dreturn
method data for {method} {0x00007f3881132558} 'integrate' '(I)D' in 'spec/benchmarks/scimark/monte_carlo/MonteCarlo'
0 bci: 6 CounterData count(0)
16 bci: 18 BranchData trap(intrinsic_or_type_checked_inlining recompiled) taken(1) displacement(200)
not taken(57586)
48 bci: 22 VirtualCallData count(60212) nonprofiled_count(0) entries(0)
method_entries(0)
104 bci: 28 VirtualCallData count(54941) nonprofiled_count(0) entries(0)
method_entries(0)
160 bci: 46 BranchData taken(16747) displacement(32)
not taken(46866)
192 bci: 55 JumpData taken(58368) displacement(-176)
--- Extra data:
264 bci: 0 ArgInfoData 0x0 0x0
@ 6 spec.benchmarks.scimark.utils.Random::<init> (53 bytes) call site not reached
s @ 22 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
s @ 28 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
-----------------------------------------------------
Obviously, the profile.count=0 (at bci:6) was incorrect, since the callsite was always reached in the caller.
The profile process was started in the loop of the caller, and the callsite(at bci:6, which is outside of the loop) had no chance to be profiled at all when the compilation is triggered.
The callsite just kept the initial status with profile.count=0, which shouldn't be regarded as unreached at all.
So for callers with loops, it may be misleading to make inline decisions based on profile.count=0 only.
## Fix
It might be better to make a little change to the inline heuristic[2].
For callers without loops, the original heuristic works fine.
But for callers with loops, it would be better to make a not-inline decision more conservatively.
To fix this issue, a patch has been proposed:
http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/webrev.00/
## Testing
- Running scimark.monte_carlo on jdk/x64 with -XX:-TieredCompilation for about 5000 times, no performance drop
Also on jdk8u/mips64 with -XX:-TieredCompilation, no performance drop
- Running make test TEST="micro" on jdk/x64, no performance regression
- Running SPECjvm2008 on jdk8u/x64 with -XX:-TieredCompilation, no performance regression
[1] http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/reproduce.sh
[2] http://hg.openjdk.java.net/jdk/jdk/file/0a2d73e02076/src/hotspot/share/opto/bytecodeInfo.cpp#l375
~15% performance degradation (from 700 ops/m to 600 ops/m) was observed randomly on x86 while running SPECjvm2008's scimark.monte_carlo with -XX:-TieredCompilation.
## Reproduce
It can be always reproduced with the script[1] in less than 5 minutes.
## Reason
The drop was caused by a not-inline decisiion on spec.benchmarks.scimark.utils.Random::<init> in spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate.
If performance drop occurs:
-----------------------------------------------------
336 71 spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate (68 bytes)
@ 6 spec.benchmarks.scimark.utils.Random::<init> (53 bytes) call site not reached
s @ 22 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
s @ 28 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
-----------------------------------------------------
If no performance drop:
-----------------------------------------------------
368 71 spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate (68 bytes)
@ 6 spec.benchmarks.scimark.utils.Random::<init> (53 bytes) inline (hot)
@ 1 java.lang.Object::<init> (1 bytes) inline (hot)
@ 49 spec.benchmarks.scimark.utils.Random::initialize (125 bytes) inline (hot)
@ 14 java.lang.Math::abs (11 bytes) executed < MinInliningThreshold times
@ 19 java.lang.Math::min (11 bytes) (intrinsic)
s @ 22 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
s @ 28 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
-----------------------------------------------------
The not-inline decisiion was made by a heuristic here[2].
It was designed not to inline unreached callsites based on profile.count=0 only.
For callers with loops, the profile.count=0 for the callsite may be incorrect and misleading.
Inline decisions based on misleading profile info only may lead to unoptimized compile code.
Actually, the preformance drop of scimark.monte_carlo was just in that case.
The code of spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate
-----------------------------------------------------
public final double integrate(int numSamples) {
Random R = new Random(SEED);
int underCurve = 0;
for (int count = 0; count < numSamples; count++) {
double x = R.nextDouble();
double y = R.nextDouble();
if ( x*x + y*y <= 1.0) {
underCurve ++;
}
}
return ((double) underCurve / numSamples) * 4.0;
}
-----------------------------------------------------
The profile info when performance drop happened
-----------------------------------------------------
0 new 7 <spec/benchmarks/scimark/utils/Random>
3 dup
4 bipush 113
6 invokespecial 8 <spec/benchmarks/scimark/utils/Random.<init>(I)V>
0 bci: 6 CounterData count(0)
9 astore_2
10 iconst_0
11 istore_3
12 iconst_0
13 istore #4
15 fast_iload #4
17 iload_1
18 if_icmpge 58
16 bci: 18 BranchData trap(intrinsic_or_type_checked_inlining recompiled) taken(1) displacement(200)
not taken(57586)
21 aload_2
22 invokevirtual 9 <spec/benchmarks/scimark/utils/Random.nextDouble()D>
48 bci: 22 VirtualCallData count(60212) nonprofiled_count(0) entries(0)
method_entries(0)
25 dstore #5
27 aload_2
28 invokevirtual 9 <spec/benchmarks/scimark/utils/Random.nextDouble()D>
104 bci: 28 VirtualCallData count(54941) nonprofiled_count(0) entries(0)
method_entries(0)
31 dstore #7
33 dload #5
35 dload #5
37 dmul
38 dload #7
40 dload #7
42 dmul
43 dadd
44 dconst_1
45 dcmpg
46 ifgt 52
160 bci: 46 BranchData taken(16747) displacement(32)
not taken(46866)
49 iinc #3 1
52 iinc #4 1
55 goto 15
192 bci: 55 JumpData taken(58368) displacement(-176)
58 iload_3
59 i2d
60 iload_1
61 i2d
62 ddiv
63 ldc2_w 4.000000
66 dmul
67 dreturn
method data for {method} {0x00007f3881132558} 'integrate' '(I)D' in 'spec/benchmarks/scimark/monte_carlo/MonteCarlo'
0 bci: 6 CounterData count(0)
16 bci: 18 BranchData trap(intrinsic_or_type_checked_inlining recompiled) taken(1) displacement(200)
not taken(57586)
48 bci: 22 VirtualCallData count(60212) nonprofiled_count(0) entries(0)
method_entries(0)
104 bci: 28 VirtualCallData count(54941) nonprofiled_count(0) entries(0)
method_entries(0)
160 bci: 46 BranchData taken(16747) displacement(32)
not taken(46866)
192 bci: 55 JumpData taken(58368) displacement(-176)
--- Extra data:
264 bci: 0 ArgInfoData 0x0 0x0
@ 6 spec.benchmarks.scimark.utils.Random::<init> (53 bytes) call site not reached
s @ 22 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
s @ 28 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
-----------------------------------------------------
Obviously, the profile.count=0 (at bci:6) was incorrect, since the callsite was always reached in the caller.
The profile process was started in the loop of the caller, and the callsite(at bci:6, which is outside of the loop) had no chance to be profiled at all when the compilation is triggered.
The callsite just kept the initial status with profile.count=0, which shouldn't be regarded as unreached at all.
So for callers with loops, it may be misleading to make inline decisions based on profile.count=0 only.
## Fix
It might be better to make a little change to the inline heuristic[2].
For callers without loops, the original heuristic works fine.
But for callers with loops, it would be better to make a not-inline decision more conservatively.
To fix this issue, a patch has been proposed:
http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/webrev.00/
## Testing
- Running scimark.monte_carlo on jdk/x64 with -XX:-TieredCompilation for about 5000 times, no performance drop
Also on jdk8u/mips64 with -XX:-TieredCompilation, no performance drop
- Running make test TEST="micro" on jdk/x64, no performance regression
- Running SPECjvm2008 on jdk8u/x64 with -XX:-TieredCompilation, no performance regression
[1] http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/reproduce.sh
[2] http://hg.openjdk.java.net/jdk/jdk/file/0a2d73e02076/src/hotspot/share/opto/bytecodeInfo.cpp#l375
- relates to
-
JDK-8224162 assert(profile.count() == 0) failed: sanity in InlineTree::is_not_reached
-
- Resolved
-