Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-6876794

sp07t002 hangs very intermittently

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Fixed
    • Icon: P4 P4
    • hs17
    • hs16
    • hotspot
    • b03
    • b02
    • generic
    • generic

        During my testing of my fix for 6419370, the following VM/NSK
        test hung:

            nsk/jvmti/scenarios/sampling/SP07/sp07t002

        on my Solaris X86 machine running the following config:

        java version "1.7.0-ea"
        Java(TM) SE Runtime Environment (build 1.7.0-ea-b70)
        Java HotSpot(TM) Client VM (build 17.0-b01-internal, mixed mode)

        I created a doit.sh script from the .tlog file and used my
        doit_loop script to reproduce the hang once:

        Results: 318 runs (1 FAILed, 317 PASSed)
        Failure details: hangs: 1
        Run duration: 3 hour(s) 51 second(s)

        I've attached the thread dump in threads.log.295 and a
        30 second truss snapshot in truss.log.295.

        Only two threads were active in the 30 second snapshot:

        % sed 's/[ ].*//' truss.log.295 | sort | uniq -c
           1 /1:
           1 /10:
           1 /11:
          43 /2:
           1 /3:
           1 /4:
           1 /5:
           1 /6:
           1 /7:
           1 /8:
         499 /9:

        Thread t@2 has the following stack snippet:

        t@2 (l@2) stopped in ___nanosleep at 0xfeec8095
        0xfeec8095: ___nanosleep+0x0015: jae ___nanosleep+0x23 [ 0xfeec80a3, .+0xe ]
        current thread: t@2
          [1] ___nanosleep(0x1), at 0xfeec8095
          [2] THREAD_sleep(0x1), at 0xfe413424
          [3] nsk_jvmti_sleep(0x3e8, 0x0), at 0xfe4166cd
          [4] syncDebugeeStatus(0x8066510, 0x8064b98, 0x0), at 0xfe4167fb
          [5] Java_nsk_share_jvmti_DebugeeClass_checkStatus(0x8066510, 0xfe55ebf0, 0x0,
        0xfe55ebc0, 0x1, 0xf81c10c0, 0xfe55ebc0, 0xf81c0e10, 0xfe55ebf8, 0xf81c17a0, 0x0, 0xf81c0e10, 0x0, 0xfe55ebf8, 0xfe55ec20, 0xfc202ea9, 0xf81c10c0, 0xfc207fd9, 0x0, 0xf44f2f38), at 0xfe4169b3
          [6] 0xfc20a23e(0xf81c10c0, 0xfc207fd9, 0x0, 0xf44f2f38, 0xfe55ec00, 0xf81c0a94), at 0xfc20a23e


        Thread t@9 has the following stack snippet:

        t@9 (l@9) stopped in __pollsys at 0xfeec8da5
        0xfeec8da5: __pollsys+0x0015: jb __cerror [ 0xfee405a0, .-0x88805
        ]
        current thread: t@9
        =>[1] __pollsys(0x0, 0x0, 0xf3e5ad60, 0x0), at 0xfeec8da5
          [2] _poll(0x0, 0x0, 0x32, 0xfe9ce7d6), at 0xfee6f502
          [3] os_sleep(0x32, 0x0, 0x0, 0xfe9ceaa2), at 0xfe9ce922
          [4] os::sleep(0x8116400, 0x32, 0x0, 0x0), at 0xfe9ceb40
          [5] WatcherThread::run(0x8116400, 0xfebfa000, 0xf3e5afd8, 0xfe9cb38e), at 0xfe6d820e


        It is interesting to see that the VMThread is blocked:

        t@3 (l@3) stopped in ___lwp_cond_wait at 0xfeec9455
        0xfeec9455: ___lwp_cond_wait+0x0015: jae ___lwp_cond_wait+0x23 [ 0xfeec9463, .+0xe ]
        current thread: t@3
        =>[1] ___lwp_cond_wait(0x80f5448, 0x80f5430, 0x0, 0x1), at 0xfeec9455
          [2] os::PlatformEvent::park(0x80f5400, 0x0, 0x0, 0xfe9ba482), at 0xfe9d0a22
          [3] ParkCommon(0x80f5400, 0x0, 0x0, 0x1), at 0xfe9ba4cd
          [4] Monitor::ILock(0x81495a8, 0x80f4800, 0xfec13758, 0xfe9bad24), at 0xfe9ba657
          [5] Monitor::lock_without_safepoint_check(0x81495a8, 0x80f4800, 0x1, 0xfe9bad58), at 0xfe9bad39
          [6] Monitor::lock_without_safepoint_check(0x81495a8, 0x1, 0x3, 0xfea0186a), at 0xfe9bada3
          [7] ThreadSafepointState::examine_state_of_thread(0x8111fb0, 0xfebfa000, 0xf3ffeca8, 0xfe702eab), at 0xfea01895
          [8] SafepointSynchronize::begin(0x8067648, 0x8067620), at 0xfe702f2a
          [9] VMThread::loop(0x80f4800, 0x7f, 0x0, 0xfe6b471a), at 0xfe6b4b38
          [10] VMThread::run(0x80f4800, 0xfebfa000, 0xf3ffefd8, 0xfe9cb38e), at 0xfe6b47a1


        Not sure if it means anything, but the first address passed
        to the Monitor::ILock() call above appears in the stack traces
        for the following threads:

        t@10 (l@10) stopped in ___lwp_cond_wait at 0xfeec9455
        0xfeec9455: ___lwp_cond_wait+0x0015: jae ___lwp_cond_wait+0x23 [ 0xfeec9463, .+0xe ]
        current thread: t@10
        =>[1] ___lwp_cond_wait(0x815b448, 0x815b430, 0x0, 0x1), at 0xfeec9455
          [2] os::PlatformEvent::park(0x815b400, 0x81495a8, 0x0, 0xfe9ba482), at 0xfe9d0a22
          [3] ParkCommon(0x815b400, 0x0, 0x0, 0xfe9ba98a), at 0xfe9ba4cd
          [4] Monitor::IWait(0x81495a8, 0x815a800, 0x0, 0x0), at 0xfe9baa21
          [5] Monitor::wait(0x81495a8, 0x1, 0x0, 0x0), at 0xfe9bb2ce
          [6] JavaThread::java_suspend_self(0x815a800, 0x0, 0xfebfa000, 0xfe715aa5), at
        0xfeaaeebc
          [7] JavaThread::check_safepoint_and_suspend_for_native_trans(0x815a800, 0x815a910, 0xf3dd7538, 0xfeaaef3c), at 0xfe715b25
          [8] JavaThread::check_special_condition_for_native_trans(0x815a800), at 0xfeaaef4e
          [9] 0xfc27b4ff(0xfe6dbf47, 0xf3dd74f8, 0xfebfa000, 0xf3dd759c, 0xf4093398, 0xfc27a588), at 0xfc27b4ff


        t@11 (l@11) stopped in ___lwp_cond_wait at 0xfeec9455
        0xfeec9455: ___lwp_cond_wait+0x0015: jae ___lwp_cond_wait+0x23 [ 0xfeec9463, .+0xe ]
        current thread: t@11
        =>[1] ___lwp_cond_wait(0x815ca48, 0x815ca30, 0x0, 0x1), at 0xfeec9455
          [2] os::PlatformEvent::park(0x815ca00, 0x0, 0x0, 0xfe9ba482), at 0xfe9d0a22
          [3] ParkCommon(0x815ca00, 0x0, 0x0, 0x1), at 0xfe9ba4cd
          [4] Monitor::ILock(0x80658c0, 0x815bc00, 0x1, 0xfe9bad24), at 0xfe9ba657
          [5] Monitor::lock_without_safepoint_check(0x80658c0, 0x815bc00, 0x0, 0xfe9bad58), at 0xfe9bad39
          [6] Monitor::lock_without_safepoint_check(0x80658c0, 0xfebfa000, 0xf3d88984, 0xfe70324a), at 0xfe9bada3
          [7] SafepointSynchronize::block(0x815bc00, 0x815bc00, 0x5, 0x0), at 0xfe7033a7
          [8] Monitor::wait(0x81495a8, 0x0, 0x5, 0x0), at 0xfe9bb2a4
          [9] JavaThread::wait_for_ext_suspend_completion(0x815a800, 0x32, 0x5, 0xf3d88b08), at 0xfeaac3dd
          [10] JvmtiEnvBase::is_thread_fully_suspended(0x815a800, 0x1, 0xf3d88b08, 0xfe9365e8), at 0xfe93dc44
          [11] JvmtiEnv::GetStackTrace(0x8064b98, 0x815a800, 0x0, 0x400, 0xfe438500), at 0xfe936607
          [12] jvmti_GetStackTrace(0x8064b98, 0x815ce24, 0x0, 0x400, 0xfe438500, 0xfe435b38), at 0xfe9114ea
          [13] checkStackTrace(0x8064b98, 0x815bd10), at 0xfe41a932
          [14] agentProc(0x8064b98, 0x815bd10, 0x0), at 0xfe41acf9
          [15] agentThreadWrapper(0x8064b98, 0x815bd10, 0x0, 0xfe9532b2), at 0xfe4162a3
          [16] JvmtiAgentThread::call_start_function(0x815bc00, 0xfee646de, 0xfef3d5d0,
        0xfe953288), at 0xfe953374
          [17] JvmtiAgentThread::start_function_wrapper(0x815bc00), at 0xfe95329a
          [18] JavaThread::thread_main_inner(0x815bc00, 0x0, 0x0, 0xfe6c2dbe), at 0xfe6c2fa9


        It looks like the VMThread (t@3) is trying to begin a
        safepoint and is blocked trying to examine the state
        of another thread in a Monitor::ILock(0x81495a8) call.

        Thread t@10 is trying to suspend itself and is calling
        Monitor::IWait(0x81495a8).

        thread t@11 is waiting for a suspend to complete, calls
        Monitor::wait(0x81495a8) and blocks on the pending safepoint.
        Uh oh....

        I suspect that t@11 is still holding the lock it is trying
        to wait() on. Look at the t@10 waiting sequence:

        =>[1] ___lwp_cond_wait(0x815b448, 0x815b430, 0x0, 0x1), at 0xfeec9455
          [2] os::PlatformEvent::park(0x815b400, 0x81495a8, 0x0, 0xfe9ba482), at 0xfe9d0a22
          [3] ParkCommon(0x815b400, 0x0, 0x0, 0xfe9ba98a), at 0xfe9ba4cd
          [4] Monitor::IWait(0x81495a8, 0x815a800, 0x0, 0x0), at 0xfe9baa21
          [5] Monitor::wait(0x81495a8, 0x1, 0x0, 0x0), at 0xfe9bb2ce
          [6] JavaThread::java_suspend_self(0x815a800, 0x0, 0xfebfa000, 0xfe715aa5), at
        0xfeaaeebc

        In t@10 we've gotten all the way down to lwp_cond_wait().

        Look again at the t@11 waiting sequence:

        =>[1] ___lwp_cond_wait(0x815ca48, 0x815ca30, 0x0, 0x1), at 0xfeec9455
          [2] os::PlatformEvent::park(0x815ca00, 0x0, 0x0, 0xfe9ba482), at 0xfe9d0a22
          [3] ParkCommon(0x815ca00, 0x0, 0x0, 0x1), at 0xfe9ba4cd
          [4] Monitor::ILock(0x80658c0, 0x815bc00, 0x1, 0xfe9bad24), at 0xfe9ba657
          [5] Monitor::lock_without_safepoint_check(0x80658c0, 0x815bc00, 0x0, 0xfe9bad58), at 0xfe9bad39
          [6] Monitor::lock_without_safepoint_check(0x80658c0, 0xfebfa000, 0xf3d88984, 0xfe70324a), at 0xfe9bada3
          [7] SafepointSynchronize::block(0x815bc00, 0x815bc00, 0x5, 0x0), at 0xfe7033a7
          [8] Monitor::wait(0x81495a8, 0x0, 0x5, 0x0), at 0xfe9bb2a4
          [9] JavaThread::wait_for_ext_suspend_completion(0x815a800, 0x32, 0x5, 0xf3d88b08), at 0xfeaac3dd

        Our Monitor::wait(0x81495a8) didn't call Monitor:IWait(...)
        and get all the way down to lwp_cond_wait(). However, it
        did call SafepointSynchronize::block()...

        I'll have to check the code to see if these traces make sense...
        I did an overnight doit_loop run with baseline JDK7-B70:

            java version "1.7.0-ea"
            Java(TM) SE Runtime Environment (build 1.7.0-ea-b70)
            Java HotSpot(TM) Client VM (build 16.0-b07, mixed mode)

            Java HotSpot(TM) Client VM (16.0-b07) for solaris-x86 JRE (1.7.0-ea-b70), built on Aug 20 2009 03:38:34 by "" with Workshop 5.9

        Here's the summary:

        Results: 1083 runs (4 FAILed, 1079 PASSed)
        Failure details: hangs: 4
        Run duration: 9 hour(s) 37 minute(s) 16 second(s)

        The truss logs are similar in that only threads t@2 and t@9
        show activity. The thread dumps have the same function names,
        but different address (as expected). I've attached all the
        logs for the record.
        I did a search of nightly testing results. Here are some
        hangs of this test observed in RT_Baseline:

        http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-08-19/RT_Baseline/vm/solaris-amd64/server/mixed/solaris-amd64_server_mixed_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log

        So the config is Solaris AMD64 Server VM -Xmixed.

        http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-08-10/RT_Baseline/vm/solaris-sparcv9/server/comp/solaris-sparcv9_server_comp_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log

        So the config is Solaris SPARC-64 Server VM -Xcomp.

        http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-06-16/RT_Baseline/vm/linux-amd64/server/comp/linux-amd64_server_comp_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log

        So the config is Linux AMD64 Server VM -Xcomp.

        http://sqeweb.sfbay/nfs/results/vm/gtee/JDK7/NIGHTLY/VM/2009-04-21/RT_Baseline/vm/windows-i586/client/mixed/windows-i586_client_mixed_nsk.jvmti.testlist/ResultDir/sp07t002/sp07t002.log

        So the config is Windows X86 Client VM -Xmixed.

        I'm changing my call record to generic/generic.
        Data from service_hs nightly analysis 2008.12.24 (final)

        Unexpected Timeouts
        -------------------

        nsk.jvmti
            nsk/jvmti/scenarios/sampling/SP07/sp07t002
                2008.11.25 Win32 Client VM (machine colfax005)
                2008.11.05 Win32 Server VM (machine dantooine)
                2008.10.29 Solaris AMD64 Server VM (machine vm-x86-1)
                2008.06.18 Solaris SPARC-64 Server VM (machine jtg-s111)
                2008.05.15 Solaris X86 Client VM (machine vm-v20z-30)
                2008.02.04 Win32 Client VM (machine vm-v20z-19)
        Data from service_hs nightly analysis 2006.11.09 (partial)

        URL: http://gtee.sfbay/gtee/results/MUSTANG/NIGHTLY/VM-MAIN/2006-11-09/Serv_Baseline/
        VM Build: 20061109061001.dcubed.service_hs_b03_merge.1 (new VM)
        Last main/baseline putback: 20061109050230.ik199011.rt_merge
        JDK: Dolphin-B02 (new JDK)
        Testbase: v160r08 (same testbase as previous version)

        Known nsk.jvmti UFOYOYOs:
            nsk/jvmti/scenarios/sampling/SP07/sp07t002
                Last failure on 2006.11.06 with Solaris SPARC-64 Server VM.
                Previous failure on 2006.10.31 with Solaris SPARC Server VM.

        The fix for 6470570 was putback to rt_baseline on 2006.09.26.
        rt_baseline was pushed to main/baseline on 2006.09.27.
        main/baseline was pushed to serv_hs_baseline on 2006.09.27.

        In the 2006.09.27 nightly report, the sp07t002 entry looked like

            nsk/jvmti/scenarios/sampling/SP07/sp07t002
                Last failure on 2005.10.04 with Linux IA32 Client VM.
                Previous failure on 2005.10.01 with Linux IA32 Client VM.

        A new hang was first noticed on 2006.10.31:

        * nsk/jvmti/scenarios/sampling/SP07/sp07t002
                Last failure on 2006.10.31 with Solaris SPARC Server VM.
                Previous failure on 2005.10.04 with Linux IA32 Client VM.

        I don't know whether I just missed any earlier hangs or not.

              dcubed Daniel Daugherty
              dcubed Daniel Daugherty
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved:
                Imported:
                Indexed: