Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-4987970

IA64 - ATG hang on IA64 linux with NPTL

XMLWordPrintable

    • itanium
    • linux_redhat_3.0

      I investigated a bigapp failure on jtg-it4.sfbay. The server of
      atg hung. This is redhat kernel 2.4.21-4.EL (AS3?). This looks
      to me to be a bug in NPTL but it is possible it is runtime
      related. In any case the runtime could possibly work around the
      issue.

      The system has hung coming to a safepoint. The stack of the vm thread
      looks like:

      #0 0x200000000021c782 in sched_yield () from /lib/tls/libc.so.6.1
      #1 0x2000000000f56910 in os::pd_suspend_thread () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #2 0x200000000108ec80 in Thread::do_vm_suspend () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #3 0x200000000109e440 in Thread::vm_suspend () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #4 0x2000000001045f30 in ThreadSafepointState::examine_state_of_thread () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #5 0x2000000001044d00 in SafepointSynchronize::begin () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #6 0x200000000112b7f0 in VMThread::loop () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #7 0x200000000112a9f0 in VMThread::run () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #8 0x2000000000f52f00 in _start () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #9 0x200000000005c510 in start_thread () from /lib/tls/libpthread.so.0

      By investigating the frames I was able to find the thread it was
      trying to suspend. That thread has this stack trace:

      #0 0x2000000000067111 in __lll_lock_wait () from /lib/tls/libpthread.so.0
      #1 0x200000000005fa80 in pthread_mutex_lock () from /lib/tls/libpthread.so.0
      #2 0x2000000000f57e00 in os::Linux::safe_mutex_lock () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #3 0x2000000000f1ce20 in Mutex::lock_without_safepoint_check () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #4 0x20000000010457c0 in SafepointSynchronize::block () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #5 0x200000000102f0c0 in OptoRuntime::complete_monitor_locking_C () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #6 0x2000000004682c70 in ?? ()

      This thread is in fact trying to come to a safepoint. The code in question
      is here: case _thread_in_vm_trans:
          case _thread_in_Java: // From compiled code

            // We are highly likely to block on the Safepoint_lock. In order to avoid blocking in this case,
            // we pretend we are still in the VM.
            thread->set_thread_state(_thread_in_vm);

            // We will always be holding the Safepoint_lock when we are examine the state
            // of a thread. Hence, the instructions between the Safepoint_lock->lock() and
            // Safepoint_lock->unlock() are happening atomic with regards to the safepoint code
            Safepoint_lock->lock_without_safepoint_check();
             ^^^^^^ BLOCKED AT THIS CALL
            if (is_synchronizing()) {
              // Decrement the number of threads to wait for and signal vm thread
              assert(_waiting_to_block > 0, "sanity check");
              _waiting_to_block--;
              thread->set_has_called_back(true);
              Safepoint_lock->notify_all();
            }

            // We transition the thread to state _thread_blocked here, but
            // we can't do our usual check for external suspension and then
            // self-suspend after the lock_without_safepoint_check() call
            // below because we are often called during transitions while
            // we hold different locks. That would leave us suspended while
            // holding a resource which results in deadlocks.
            thread->set_thread_state(_thread_blocked);
            Safepoint_lock->unlock();


      Now the only reason that we should be able to get stuck here forever is
      either the vm thread is never relinquishing the lock during its loop
      of examining threads or the lock mechanism is broken and this thread
      didn't get to acquire the lock when it was freed. So I investigated
      which was happening. By placing breakpoints in code the vm thread was
      executing I could see that it never released the Safepoint_lock. This
      is because it is trapped in an infinite loop. The vm thread stack trace
      once again looked like:

      #0 0x200000000021c782 in sched_yield () from /lib/tls/libc.so.6.1
      #1 0x2000000000f56910 in os::pd_suspend_thread () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #2 0x200000000108ec80 in Thread::do_vm_suspend () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #3 0x200000000109e440 in Thread::vm_suspend () from /usr/j2se/jre/lib/ia64/server/libjvm.so
      #4 0x2000000001045f30 in ThreadSafepointState::examine_state_of_thread () from /usr/j2se/jre/lib/ia64/server/libjvm.so

      We never return from pd_suspend_thread. That code lookes like:
      int os::pd_suspend_thread(Thread* thread, bool fence) {
          int ret;
          OSThread *osthread = thread->osthread();

          if (fence) {
            ThreadCritical tc;
            ret = do_suspend(osthread, SR_SUSPEND);
          } else {
            ret = do_suspend(osthread, SR_SUSPEND);
          }
          return ret;
      }

      Which doesn't even loop or call yield. Well the compiler has gotten fancy
      it seems and has inlined do_suspend() which looks like:
      int do_suspend(OSThread* osthread, int action) {
        int ret;

        // set suspend action and send signal
        osthread->sr.set_suspend_action(action);

        ret = pthread_kill(osthread->pthread_id(), SR_signum);
        // check return code and wait for notification

        if (ret == 0) {
          for (int i = 0; !osthread->sr.is_suspended(); i++) {
            os::yield_all(i);
          }
        }

        osthread->sr.set_suspend_action(SR_NONE);
        return ret;
      }

      The for loop is essentially infinite of the signal SR_signum is dropped.
      I suspect that is what has happened here. The other alternative is that
      is_suspended is not volatile but the generated code in fact does ld4.acq
      on every iteration of the loop so appears to be ok. Only other possibility
      is that we get the signal and don't set the flag. This is pretty old code
      and has worked ok for some time so I doubt that it is the case. I'm more
      suspicious of a bug in signal delivery, possibly itanium specific.


      ###@###.### 2004-03-11
      I reproduced the problem several times using atg test suite on Itanium RedHat AS 3.0/2.1.
      So far I haven't seen the problem if LD_ASSUME_KERNEL is set to 2.4.1
      ###@###.### 10/5/04 20:01 GMT

            bobv Bob Vandette (Inactive)
            sgoldman Steve Goldman (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved:
              Imported:
              Indexed: