Loading...

Type: Bug
Resolution: Fixed
Priority: P4
Fix Version/s: 13
Affects Version/s: 9, 10, 11, 12, 13
Component/s: hotspot
Labels:

Subcomponent:
compiler
Resolved In Build:
b20
CPU:

generic
OS:

linux

## Symptom
~15% performance degradation (from 700 ops/m to 600 ops/m) was observed randomly on x86 while running SPECjvm2008's scimark.monte_carlo with -XX:-TieredCompilation.

## Reproduce
It can be always reproduced with the script[1] in less than 5 minutes.

## Reason
The drop was caused by a not-inline decisiion on spec.benchmarks.scimark.utils.Random::<init> in spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate.

If performance drop occurs:
-----------------------------------------------------
336 71 spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate (68 bytes)
                        @ 6 spec.benchmarks.scimark.utils.Random::<init> (53 bytes) call site not reached
          s @ 22 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
          s @ 28 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
-----------------------------------------------------

If no performance drop:
-----------------------------------------------------
368 71 spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate (68 bytes)
                        @ 6 spec.benchmarks.scimark.utils.Random::<init> (53 bytes) inline (hot)
                          @ 1 java.lang.Object::<init> (1 bytes) inline (hot)
                          @ 49 spec.benchmarks.scimark.utils.Random::initialize (125 bytes) inline (hot)
                            @ 14 java.lang.Math::abs (11 bytes) executed < MinInliningThreshold times
                            @ 19 java.lang.Math::min (11 bytes) (intrinsic)
          s @ 22 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
          s @ 28 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
-----------------------------------------------------

The not-inline decisiion was made by a heuristic here[2].
It was designed not to inline unreached callsites based on profile.count=0 only.

For callers with loops, the profile.count=0 for the callsite may be incorrect and misleading.
Inline decisions based on misleading profile info only may lead to unoptimized compile code.
Actually, the preformance drop of scimark.monte_carlo was just in that case.

The code of spec.benchmarks.scimark.monte_carlo.MonteCarlo::integrate
-----------------------------------------------------
    public final double integrate(int numSamples) {

        Random R = new Random(SEED);

        int underCurve = 0;
        for (int count = 0; count < numSamples; count++) {

            double x = R.nextDouble();
            double y = R.nextDouble();

            if ( x*x + y*y <= 1.0) {
                underCurve ++;
            }
        }
        return ((double) underCurve / numSamples) * 4.0;
    }
-----------------------------------------------------

The profile info when performance drop happened
-----------------------------------------------------
0 new 7 <spec/benchmarks/scimark/utils/Random>
3 dup
4 bipush 113
6 invokespecial 8 <spec/benchmarks/scimark/utils/Random.<init>(I)V>
  0 bci: 6 CounterData count(0)
9 astore_2
10 iconst_0
11 istore_3
12 iconst_0
13 istore #4
15 fast_iload #4
17 iload_1
18 if_icmpge 58
  16 bci: 18 BranchData trap(intrinsic_or_type_checked_inlining recompiled) taken(1) displacement(200)
                                    not taken(57586)
21 aload_2
22 invokevirtual 9 <spec/benchmarks/scimark/utils/Random.nextDouble()D>
  48 bci: 22 VirtualCallData count(60212) nonprofiled_count(0) entries(0)
                                    method_entries(0)
25 dstore #5
27 aload_2
28 invokevirtual 9 <spec/benchmarks/scimark/utils/Random.nextDouble()D>
  104 bci: 28 VirtualCallData count(54941) nonprofiled_count(0) entries(0)
                                    method_entries(0)
31 dstore #7
33 dload #5
35 dload #5
37 dmul
38 dload #7
40 dload #7
42 dmul
43 dadd
44 dconst_1
45 dcmpg
46 ifgt 52
  160 bci: 46 BranchData taken(16747) displacement(32)
                                    not taken(46866)
49 iinc #3 1
52 iinc #4 1
55 goto 15
  192 bci: 55 JumpData taken(58368) displacement(-176)
58 iload_3
59 i2d
60 iload_1
61 i2d
62 ddiv
63 ldc2_w 4.000000
66 dmul
67 dreturn
method data for {method} {0x00007f3881132558} 'integrate' '(I)D' in 'spec/benchmarks/scimark/monte_carlo/MonteCarlo'
0 bci: 6 CounterData count(0)
16 bci: 18 BranchData trap(intrinsic_or_type_checked_inlining recompiled) taken(1) displacement(200)
                                    not taken(57586)
48 bci: 22 VirtualCallData count(60212) nonprofiled_count(0) entries(0)
                                    method_entries(0)
104 bci: 28 VirtualCallData count(54941) nonprofiled_count(0) entries(0)
                                    method_entries(0)
160 bci: 46 BranchData taken(16747) displacement(32)
                                    not taken(46866)
192 bci: 55 JumpData taken(58368) displacement(-176)
--- Extra data:
264 bci: 0 ArgInfoData 0x0 0x0
                            @ 6 spec.benchmarks.scimark.utils.Random::<init> (53 bytes) call site not reached
              s @ 22 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
              s @ 28 spec.benchmarks.scimark.utils.Random::nextDouble (124 bytes) inline (hot)
-----------------------------------------------------

Obviously, the profile.count=0 (at bci:6) was incorrect, since the callsite was always reached in the caller.
The profile process was started in the loop of the caller, and the callsite(at bci:6, which is outside of the loop) had no chance to be profiled at all when the compilation is triggered.
The callsite just kept the initial status with profile.count=0, which shouldn't be regarded as unreached at all.

So for callers with loops, it may be misleading to make inline decisions based on profile.count=0 only.

## Fix
It might be better to make a little change to the inline heuristic[2].

For callers without loops, the original heuristic works fine.
But for callers with loops, it would be better to make a not-inline decision more conservatively.

To fix this issue, a patch has been proposed:
  http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/webrev.00/

## Testing
- Running scimark.monte_carlo on jdk/x64 with -XX:-TieredCompilation for about 5000 times, no performance drop
  Also on jdk8u/mips64 with -XX:-TieredCompilation, no performance drop
- Running make test TEST="micro" on jdk/x64, no performance regression
- Running SPECjvm2008 on jdk8u/x64 with -XX:-TieredCompilation, no performance regression

[1] http://cr.openjdk.java.net/~jiefu/monte_carlo-perf-drop/reproduce.sh
[2] http://hg.openjdk.java.net/jdk/jdk/file/0a2d73e02076/src/hotspot/share/opto/bytecodeInfo.cpp#l375

relates to

JDK-8224162 assert(profile.count() == 0) failed: sanity in InlineTree::is_not_reached

Resolved

Details

Description

Attachments

Issue Links

Activity

People

Dates