Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8028073

race condition in ObjectMonitor implementation causing deadlocks

XMLWordPrintable

    • b04
    • generic
    • generic

        The bug JDK-8014910 (and its dup: JDK-8015729) was fixed by applying
        the anti-delta to the fix of the JDK-8008962.
        In fact, this was just a work around the issue that remained in the ObjectMonitor implementation.
        The above fix lowered the deadlocks probability but they still have a potential to happen.
        We may have a long history of tests timeouts in the nightly caused by this issue.
        These timeouts are normally hard to investigate, they cause tests instability and slowdowns.

        Below is the email exchange on this issue and variants of fixes.

        Hi Dan,

        On 8/11/2013 12:15 PM, Daniel D. Daugherty wrote:
        > Been reading through these e-mails and the code for a couple of hours.
        > I concur that an early return when (_succ != NULL) is not 'safe' since
        > that successor might be parked. However, I think I would handle it
        > differently and clear the _succ field if that _succ was about to park
        > itself so that _succ selection was not cached across the park().

        I also suggested this then retracted it. There would be a race with clearing _succ that would cause the same hang as today (unless the check is also modified in some way).

        That said, another possibility would be that if a thread sees that it is _succ then it uses the timed-park instead of the untimed one in ReenterI:

                   if (SyncFlags & 1 || _succ == Self) {
                      Self->_ParkEvent->park ((jlong)1000) ;
                   } else {
                      Self->_ParkEvent->park () ;
                   }


        > Removing the '_succ != NULL' fragment below pretty much means that
        > the non-NULL _succ value will not keep you from doing all the work
        > that is below those code blocks in ObjectMonitor::exit(). I think
        > not doing that work is exactly what a non-NULL _succ was supposed
        > to help you do.

        I also considered that the "find the successor" logic could be short-circuited if _succ was already non-null. I don't think it can be critical to performance though because if it was often the case that _succ was non-null then we would hit this hang - and we don't. Hence it seems to me it must be rare to hit this case and so finding the next successor when we already have one will not impact overall performance.

        That said it may be that we can simply do the following:

         if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0) {
            return ;
         }
         if (_succ != NULL) {
            _succ->event->unpark();
            return;
         }
         // else we must set up the successor

        But I worry about making changes to an extremely complex protocol, especially when we seem to have found a condition that indicates that the detailed explanation of the successor protocol (lines 1003 - 1032) is in fact flawed.

        David
        -----

        >
        > Dan
        >
        >
        > On 11/6/13 2:31 AM, serguei.spitsyn@oracle.com wrote:
        >> On 11/5/13 8:49 PM, David Holmes wrote:
        >>> Hi Serguei,
        >>>
        >>> I don't agree with your fix Please read on ...
        >>>
        >>> This is very tricky code ...
        >>>
        >>> We have in essence:
        >>>
        >>> if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 || _succ != NULL)
        >>> return;
        >>>
        >>> So lets break this apart and consider each part one at a time:
        >>>
        >>> if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0)
        >>> // no one waiting for entry so nothing to do
        >>> return;
        >>>
        >>> No problem there. Otherwise we have at least one waiting thread
        >>> either in the _EntryList or the _cxq.
        >>
        >>> (Note if _succ != NULL the queues can not be empty!) So we
        >>> potentially have to wake a waiting thread up.
        >>
        >> Then there is no point to check the condition (_succ != NULL) as we
        >> already checked
        >> the condition (intptr_t(_EntryList)|intptr_t(_cxq)) == 0).
        >> I see below, you've already come to the conclusion that the _succ
        >> condition is not needed.
        >>
        >>
        >>> Now if _succ == NULL then we definitely have to wake something up as
        >>> we have to make it the new _succ! But if _succ != NULL then this has
        >>> already been done - the successor has been chosen (not 100% sure by
        >>> whom or when) so we "don't need to do that". In which case:
        >>>
        >>> if (_succ != NULL)
        >>> // someone already handled _succ
        >>> return;
        >>>
        >>> so the original code seems correct as it stands. But lets revisit the
        >>> "don't need to do that" part.
        >>>
        >>> The key to the above is understanding how _succ already came to be
        >>> set. So lets examine the scenario that was put on the open email
        >>> thread. Thread t1 did a wait(). Thread t2 did a notify and exit,
        >>> which unparked t1 after making it _succ. But thread t3 sneaks in and
        >>> grabs the monitor via the fast-path. So t1 re-parks itself but is
        >>> still _succ. Now t3 does the exit and it hits the logic we are
        >>> discussing and it will just return because _succ != NULL. But we get
        >>> a hang so this is "obviously" wrong so it should not have returned if
        >>> _succ != NULL. But we previously established it should also not
        >>> return if _succ == NULL (as the queues are not empty and a successor
        >>> must be appointed!).
        >>
        >> My fix was based on the same judgement as above.
        >> I agree, we can get rid of the _succ condition but it still can be
        >> incomplete due to other aspects.
        >>
        >>>
        >>> Based on that the "correct" form of the logic would seem to be simply:
        >>>
        >>> if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0)
        >>> // no one waiting for entry so nothing to do
        >>> return;
        >>>
        >>
        >> 1000 runs of the test with the fix above did not hang.
        >>
        >>> Aside: Arguably when the code later finds a successor it could check
        >>> that it isn't displacing the existing _succ thread (which would only
        >>> be a fairness consideration).
        >>>
        >>> Alternatively I think the existing code would also work if when t3
        >>> reparks it first clears itself from _succ. Or maybe t3 should have
        >>> become _Responsible and so done a timed-park? It is hard to say what
        >>> is the real underlying problem here because the protocol is so
        >>> complicated. Any of the suggested fixes may be incomplete due to
        >>> other aspects of that protocol.
        >>>
        >>> I'm assuming Dave Dice must be on vacation without email access,
        >>> otherwise I'm sure he would have jumped in on this just because it is
        >>> so much fun
        >>
        >> I hope, Dave will have this fun upon return from vacation.
        >>
        >>
        >> Thanks for the nice analysis, David!
        >> Serguei
        >>
        >>>
        >>> Cheers,
        >>> David
        >>>
        >>> On 6/11/2013 6:27 AM, serguei.spitsyn@oracle.com wrote:
        >>>>
        >>>> On 11/4/13 6:09 PM, serguei.spitsyn@oracle.com wrote:
        >>>>> Hi David and Dave,
        >>>>>
        >>>>> (Added Staffan and Dan to the cc-list)
        >>>>>
        >>>>> It looks like I managed to solate the issue to the following fix:
        >>>>
        >>>> Webrev for convenience:
        >>>> http://javaweb.sfbay.sun.com/java/svc/ss45998/webrevs/2013/hotspot/ObjectMonitor/
        >>>>
        >>>>
        >>>> The first fragment under the condition "if (Knob_ExitPolicy == 0)" is
        >>>> actual fix.
        >>>> The other fragment is fixed for consistency in a case the
        >>>> Knob_ExitPolicy is changed in the future.
        >>>>
        >>>> Thanks,
        >>>> Serguei
        >>>>
        >>>>>
        >>>>> % hg diff
        >>>>> diff -r 04df110c8655 src/share/vm/runtime/objectMonitor.cpp
        >>>>> --- a/src/share/vm/runtime/objectMonitor.cpp Sat Nov 02 20:56:18
        >>>>> 2013 +0100
        >>>>> +++ b/src/share/vm/runtime/objectMonitor.cpp Mon Nov 04 20:47:37
        >>>>> 2013 -0800
        >>>>> @@ -994,7 +994,7 @@ void ATTR ObjectMonitor::exit(bool not_s
        >>>>> // ST on x64, x86 and SPARC.
        >>>>> OrderAccess::release_store_ptr (&_owner, NULL) ; // drop the
        >>>>> lock
        >>>>> OrderAccess::storeload() ; // See if
        >>>>> we need to wake a successor
        >>>>> - if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 || _succ !=
        >>>>> NULL) {
        >>>>> + if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 && _succ ==
        >>>>> NULL) {
        >>>>> TEVENT (Inflated exit - simple egress) ;
        >>>>> return ;
        >>>>> }
        >>>>> @@ -1042,7 +1042,7 @@ void ATTR ObjectMonitor::exit(bool not_s
        >>>>> }
        >>>>> TEVENT (Exit - Reacquired) ;
        >>>>> } else {
        >>>>> - if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 || _succ !=
        >>>>> NULL) {
        >>>>> + if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 && _succ ==
        >>>>> NULL) {
        >>>>> OrderAccess::release_store_ptr (&_owner, NULL) ; //
        >>>>> drop the lock
        >>>>> OrderAccess::storeload() ;
        >>>>> // Ratify the previously observed values.
        >>>>>
        >>>>>
        >>>>> With this fix above the hang does not happen anymore (I test it with
        >>>>> the Ioi's fix rolled back).
        >>>>> The test passed in 2000 runs on Solaris-amd64.
        >>>>> I'm currently verifying this fix with the nsk.jvmti and nsk.jdi tests.
        >>>>> They look good so far.
        >>>>>
        >>>>> Additionally, it looks like the test MonitorEvents002 is executed
        >>>>> noticeably faster,
        >>>>> but need to measure it more carefully.
        >>>>>
        >>>>> The Ioi's fix of the JDK-8015729 was just a work around the existing
        >>>>> issue.
        >>>>> We may have other deadlocks in Nightlies that are caused by this
        >>>>> issue.
        >>>>>
        >>>>> I will file a bug on this if it is Ok with you.
        >>>>> Please, let me know your opinion and if you have any concerns.
        >>>>>
        >>>>> Thanks,
        >>>>> Serguei
        >>>>>
        >>>>>
        >>>>> On 10/29/13 10:49 AM, serguei.spitsyn@oracle.com wrote:
        >>>>>> On 10/29/13 4:19 AM, David Holmes wrote:
        >>>>>>> Dave,
        >>>>>>>
        >>>>>>> FYI this bug seems related to a query on the openlists about how
        >>>>>>> succ_ was used. I haven't had time to try and analyze it in detail.
        >>>>>>> Serguei has some comments below.
        >>>>>>>
        >>>>>>> Thanks,
        >>>>>>> David
        >>>>>>>
        >>>>>>>
        >>>>>>> -------- Original Message --------
        >>>>>>> Subject: [JBS] (JDK-8015729)
        >>>>>>> nsk/jdi/stress/MonitorEvents/MonitorEvents002 hangs
        >>>>>>> Date: Tue, 29 Oct 2013 10:23:33 +0000 (UTC)
        >>>>>>> From: Serguei Spitsyn (JBS) <do-not-reply@openjdk.java.net>
        >>>>>>> To: david.holmes@oracle.com
        >>>>>>>
        >>>>>>>
        >>>>>>> [
        >>>>>>> https://bugs.openjdk.java.net/browse/JDK-8015729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13421337#comment-13421337
        >>>>>>>
        >>>>>>> ]
        >>>>>>>
        >>>>>>> Serguei Spitsyn edited comment on JDK-8015729 at 10/29/13 10:22 AM:
        >>>>>>> --------------------------------------------------------------------
        >>>>>>>
        >>>>>>> Focusing on the following fragment:
        >>>>>>>
        >>>>>>> void ATTR ObjectMonitor::exit(bool not_suspended, TRAPS) {
        >>>>>>> . . .
        >>>>>>> for (;;) {
        >>>>>>> . . .
        >>>>>>> if (Knob_ExitPolicy == 0) {
        >>>>>>> . . .
        >>>>>>> if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 || _succ !=
        >>>>>>> NULL) {
        >>>>>>> TEVENT (Inflated exit - simple egress) ;
        >>>>>>> return ;
        >>>>>>> }
        >>>>>>>
        >>>>>>> The line with "_succ != NULL" is a suspect.
        >>>>>>> Is a possible unparking of the _succ thread missed above?
        >>>>>>>
        >>>>>>>
        >>>>>>>
        >>>>>>> was (Author: sspitsyn):
        >>>>>>> Focusing on the following fragment:
        >>>>>>>
        >>>>>>> void ATTR ObjectMonitor::exit(bool not_suspended, TRAPS) {
        >>>>>>> . . .
        >>>>>>> for (;;) {
        >>>>>>> . . .
        >>>>>>> if (Knob_ExitPolicy == 0) {
        >>>>>>> . . .
        >>>>>>> if ((intptr_t(_EntryList)|intptr_t(_cxq)) == 0 || _succ !=
        >>>>>>> NULL) {
        >>>>>>> TEVENT (Inflated exit - simple egress) ;
        >>>>>>> return ;
        >>>>>>> }
        >>>>>>>
        >>>>>>> The line with "_succ != NULL" is a suspect.
        >>>>>>> Just wonder if the condition is correct. Should we replace the '||'
        >>>>>>> with '&&' ?
        >>>>>>
        >>>>>> Please, skip this comment.
        >>>>>> Most likely, the line is correct as the test hangs reliably with the
        >>>>>> replace of '||' to '&&'.
        >>>>>>
        >>>>>>
        >>>>>>> Is the unparking of the _succ thread missed above?
        >>>>>>
        >>>>>> This is a concern.
        >>>>>> It seems there is a synchronization whole allowing a race between
        >>>>>> the ObjectMonitor::exit (T#3) and re-parking of the notified
        >>>>>> thread T#1
        >>>>>> blocked in ObjectMonitor::ReenterI().
        >>>>>> We may need a better prove and details of what exactly happens in
        >>>>>> such hangs.
        >>>>>>
        >>>>>> Thanks,
        >>>>>> Serguei

          1. doit-jvmg.log
            13 kB
          2. StressMonitorWait.tgz
            7 kB
          3. successor.notes
            108 kB
          4. threads.list.jvmg
            84 kB

              dcubed Daniel Daugherty
              sspitsyn Serguei Spitsyn
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

                Created:
                Updated:
                Resolved: