Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8212933

Thread-SMR: requesting a VM operation whilst holding a ThreadsListHandle can cause deadlocks

XMLWordPrintable

    • b18
    • generic
    • generic

        Leonid Mesnik extended the Kitchensink stress testing with modules providing test coverage for the JVMTI and found a deadlock.
        Please, find a file with thread dumps in the attachments.

        This is some initial analysis from Robbin Enh:

        Robbin Enh:

        In short:
        # VM Thread
        VM Thread is in a loop, takes Threads_lock, takes a new snapshot of the Threads_list, scans the list and process handshakes on behalf of safe threads.
        Releases snapshot and Threads_lock and checks if all handshakes are completed

        # An exiting thread
        A thread exiting thread removes it self from _THE_ threads list, but must stick around if it is on any snapshots of alive. When it is no on any list it will cancel the handshake.

        Since VM thread during the handshake takes a new snapshot every iteration any exiting can proceed since it will not be a the new snapshot. Thus cancel the handshake and VM thread can exit the loop (if this was the last handshake).

        Constraint:
        If any thread grabs a snapshot of threads list and later tries to take a lock which is 'used' by VM Thread or inside the handshake we can deadlock.

        Considering that looking at e.g. : JvmtiEnv::SuspendThreadList
        Which calls VMThread::execute(&tsj); with a ThreadsListHandle alive, this could deadlock AFAICT. Since the thread will rest on VMOperationRequest_lock with a Threads list snapshot but VM thread cannot finishes handshake until that snapshot is released.

        I suggest first step is to add something like this patch below and fix the obvious ones first.

        Note, I have not verified that is the problem you are seeing, I'm saying that this seem to be real issue. And considering how the stack traces looks, it may be this.

        You want me going through this, just assign a bug if there is one?

        /Robbin

        diff -r 622fd3608374 src/hotspot/share/runtime/thread.hpp
        --- a/src/hotspot/share/runtime/thread.hpp Tue Oct 23 13:27:41 2018 +0200
        +++ b/src/hotspot/share/runtime/thread.hpp Wed Oct 24 09:13:17 2018 +0200
        @@ -167,2 +167,6 @@
           }
        + public:
        + bool have_threads_list();
        + private:
        +
           // This field is enabled via -XX:+EnableThreadSMRStatistics:
        diff -r 622fd3608374 src/hotspot/share/runtime/thread.inline.hpp
        --- a/src/hotspot/share/runtime/thread.inline.hpp Tue Oct 23 13:27:41 2018 +0200
        +++ b/src/hotspot/share/runtime/thread.inline.hpp Wed Oct 24 09:13:17 2018 +0200
        @@ -111,2 +111,6 @@

        +inline bool Thread::have_threads_list() {
        + return OrderAccess::load_acquire(&_threads_hazard_ptr) != NULL;
        +}
        +
         inline void Thread::set_threads_hazard_ptr(ThreadsList* new_list) {
        diff -r 622fd3608374 src/hotspot/share/runtime/vmThread.cpp
        --- a/src/hotspot/share/runtime/vmThread.cpp Tue Oct 23 13:27:41 2018 +0200
        +++ b/src/hotspot/share/runtime/vmThread.cpp Wed Oct 24 09:13:17 2018 +0200
        @@ -608,2 +608,3 @@
           if (!t->is_VM_thread()) {
        + assert(!t->have_threads_list(), "Deadlock if we have exiting threads and if vm thread is running an VM op."); // fatal/guarantee
             SkipGCALot sgcalot(t); // avoid re-entrant attempts to gc-a-lot

        David Holmes:

        Thanks Robbin! So you're no allowed to request a VM operation if you hold a ThreadsListHandle ? I suppose that is no different to not being able to request a VM operation whilst holding the Threads_lock.

        I suspect before ThreadSMR this may have been a case where we weren't ensuring a target thread could not terminate, and now with SMR we're ensuring that but potentially introducing a deadlock. I say potentially because obviously we don't deadlock every time we suspend threads.

        One more description from Robbin:

        1: VM Threads takes a thread list snapshot, Z
        2: Arm all threads on that list
        3: Save a count on how many are armed
        3: Release threads list
        -> 5: Takes Threads_lock
        6: Takes a new threads list snapshot, N+1
        7: Iterate over the threads, testing if the thread it self have done the
        handshake operation, or if the VM thread may processes it for the thread.
        8: Release threads list and Threads_lock
        9: Check if count is zero, else go to step 5.

        Thread X have already transition into VM and started the exit path,
        but is still present on threads list when snapshot Z is taken.

        Thread W takes a threads list snapshot which is the same snapshot as Z, meaning thread X is present in there.
        Thread W calls ::execute and waits on VMOperationRequest_lock for it's ticket to come up.

        Thread X grabs Threads_lock and removes it self from _the_ threads list.
        But since it present on threads W snapshot Z it must wait until that snapshot is released.

        The VM cannot see Thread X, since it takes a new list every iteration, N+1 will not contain X.

        Thread X cannot cancel it's handshake until it is off all threads list snapshots
        because it my be present on the list the VM thread holds.
        Thread X cannot processes it's handshake because it's off the threads list, e.g.
        GC cannot see it.

        While thread W holds a snapshot (z) of the threads list with X on it the handshake cannot be completed, and while VM thread is running the handshake it
        cannot processes W's VM op.

              rehn Robbin Ehn
              sspitsyn Serguei Spitsyn
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: