Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8252090

JFR: StreamWriterHost::write_unbuffered() stucks in an infinite loop OpenJDK (build 13.0.1+9)

    XMLWordPrintable

Details

    • jfr
    • b14
    • x86_64
    • linux

    Backports

      Description

        ADDITIONAL SYSTEM INFORMATION :
        $ /usr/lib/jvm/jdk-13/bin/java -version
        openjdk version "13.0.1" 2019-10-15
        OpenJDK Runtime Environment (build 13.0.1+9)
        OpenJDK 64-Bit Server VM (build 13.0.1+9, mixed mode, sharing)

        $ uname -a
        Linux <hostname> 5.1.5-050105-generic #201905251333 SMP Wed Jun 19 09:49:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux



        While the OpenJDK we use is somewhat outdated, the write_unbuffered() code is mostly unchanged and still seem to have the same issue.
        https://github.com/openjdk/jdk/blob/51c8ac362e051c5e1978e1d696f537dbc0b29e03/src/hotspot/share/jfr/writers/jfrStreamWriterHost.inline.hpp#L112-L116

        A DESCRIPTION OF THE PROBLEM :
        We encountered an issue where JVM becomes completely unresponsive. Log indicates that the all threads except "JFR Recorder Thread" are paused in a safepoint, waiting for the JFR thread to reach the safepoint.

        This is a second time we saw this issue. The first time was in January 2020.

        ```
        [2020-08-16T14:32:56.366+0900][warning][safepoint ]
        [2020-08-16T14:32:56.366+0900][warning][safepoint ] # SafepointSynchronize::begin: Timeout detected:
        [2020-08-16T14:32:56.366+0900][warning][safepoint ] # SafepointSynchronize::begin: Timed out while spinning to reach a safepoint.
        [2020-08-16T14:32:56.366+0900][warning][safepoint ] # SafepointSynchronize::begin: Threads which did not reach the safepoint:
        [2020-08-16T14:32:56.366+0900][warning][safepoint ] # "JFR Recorder Thread" #16 daemon prio=5 os_prio=0 cpu=5866422.36ms elapsed=6491407.13s tid=0x00007f285ee519b0 nid=0x465d runnable [0x0000000000000000]
        [2020-08-16T14:32:56.366+0900][warning][safepoint ] java.lang.Thread.State: RUNNABLE
        [2020-08-16T14:32:56.366+0900][warning][safepoint ]
        [2020-08-16T14:32:56.366+0900][warning][safepoint ] #
        SafepointSynchronize::begin: (End of list)
        ```

        The JFR Recorder Thread was using 100% CPU. Here's the backtrace.

        ```
        (gdb) bt
        #0 0x00007f28637782b7 in write () from /lib/x86_64-linux-gnu/libpthread.so.0
        #1 0x00007f2862790f19 in StreamWriterHost<MallocAdapter<1048576ul>, JfrCHeapObj>::write_unbuffered(void const*, unsigned long) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
        #2 0x00007f28627908c2 in JfrStorage::write() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
        #3 0x00007f2862783a93 in JfrRecorderService::pre_safepoint_write() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
        #4 0x00007f2862784696 in JfrRecorderService::write() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
        #5 0x00007f2862785c8b in JfrRecorderService::rotate(int) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
        #6 0x00007f28627865b2 in recorderthread_entry(JavaThread*, Thread*) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
        #7 0x00007f2862c76382 in JavaThread::thread_main_inner() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
        #8 0x00007f2862c7afed in Thread::call_run() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
        #9 0x00007f2862acd867 in thread_native_entry(Thread*) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
        #10 0x00007f286376e6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
        #11 0x00007f286329388f in clone () from /lib/x86_64-linux-gnu/libc.so.6
        ```

        We confirmed that the thread is running [87eee0, 87ef28] (marked by \*) endlessly in the assembly list (with comments) below. The corresponding source code is also listed below.

        ```asm
        000000000087eea0 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm>:
          87eea0: 55 push %rbp
          87eea1: 48 89 e5 mov %rsp,%rbp
          87eea4: 41 57 push %r15
          87eea6: 41 56 push %r14
          87eea8: 41 55 push %r13
          87eeaa: 49 89 fd mov %rdi,%r13
          87eead: 41 54 push %r12
          87eeaf: 49 89 d4 mov %rdx,%r12
          87eeb2: 53 push %rbx
          87eeb3: 48 83 ec 18 sub $0x18,%rsp
          87eeb7: 44 8b 7f 48 mov 0x48(%rdi),%r15d
          87eebb: 48 89 75 c8 mov %rsi,-0x38(%rbp)
          87eebf: 41 83 ff ff cmp $0xffffffff,%r15d
          87eec3: 74 0c je 87eed1 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x31>
          87eec5: 4c 8b 37 mov (%rdi),%r14
          87eec8: 48 8b 47 08 mov 0x8(%rdi),%rax
          87eecc: 4c 29 f0 sub %r14,%rax
          87eecf: 75 6f jne 87ef40 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xa0>
          87eed1: 4d 85 e4 test %r12,%r12
          87eed4: 74 54 je 87ef2a <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x8a>
          87eed6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
          87eedd: 00 00 00
        * 87eee0: 44 89 e3 mov %r12d,%ebx ; p/x $r12 => 0xffffffff00000000 ; p/x $r12d => 0x0 ; n = len
        * 87eee3: 41 81 fc ff ff ff 7f cmp $0x7fffffff,%r12d
        * 87eeea: b8 ff ff ff 7f mov $0x7fffffff,%eax
        * 87eeef: 45 8b 75 48 mov 0x48(%r13),%r14d ; p $r14d => 3 ; this->_fd (/tmp/2020_06_02_11_22_49_17357/2020_08_16_14_31_56.part)
        * 87eef3: 48 0f 43 d8 cmovae %rax,%rbx ; n = INT_MAX if INT_MAX < (int) len ; p $rbx => 0
        * 87eef7: eb 11 jmp 87ef0a <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x6a>
        * 87eef9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
        * 87ef00: e8 7b fb 9f ff callq 27ea80 <__errno_location@plt>
        * 87ef05: 83 38 04 cmpl $0x4,(%rax)
        * 87ef08: 75 17 jne 87ef21 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x81>
        * 87ef0a: 48 8b 75 c8 mov -0x38(%rbp),%rsi ; x/g $rbp-0x38 => 0x7f2637ffeb18: 0x00007f2838042290 ; buf of write(2)
        * 87ef0e: 48 89 da mov %rbx,%rdx ; p $rdx => 0 ; count of write(2)
        * 87ef11: 44 89 f7 mov %r14d,%edi
                                                ; write(3, 0x00007f2838042290, 0) #=> 0 has been executed repeatedly.
        * 87ef14: e8 77 fc 9f ff callq 27eb90 <write@plt>
        * 87ef19: 49 89 c7 mov %rax,%r15 ; p $rax => 0 ; ret of write(2)
        * 87ef1c: 83 f8 ff cmp $0xffffffff,%eax
        * 87ef1f: 74 df je 87ef00 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x60>
        * 87ef21: 4d 01 7d 40 add %r15,0x40(%r13)
        * 87ef25: 49 29 dc sub %rbx,%r12 ; p/x $r12 => 0xffffffff00000000 ; len -= n
                                                ; We keep jumping back to 87eee0 as the upper 32 bits of %r12 (len) never reaches 0.
        * 87ef28: 75 b6 jne 87eee0 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x40>
          87ef2a: 48 83 c4 18 add $0x18,%rsp
          87ef2e: 5b pop %rbx
          87ef2f: 41 5c pop %r12
          87ef31: 41 5d pop %r13
          87ef33: 41 5e pop %r14
          87ef35: 41 5f pop %r15
          87ef37: 5d pop %rbp
          87ef38: c3 retq
          87ef39: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
          87ef40: 89 c0 mov %eax,%eax
          87ef42: 48 89 45 c0 mov %rax,-0x40(%rbp)
          87ef46: eb 12 jmp 87ef5a <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xba>
          87ef48: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
          87ef4f: 00
          87ef50: e8 2b fb 9f ff callq 27ea80 <__errno_location@plt>
          87ef55: 83 38 04 cmpl $0x4,(%rax)
          87ef58: 75 17 jne 87ef71 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xd1>
          87ef5a: 48 8b 55 c0 mov -0x40(%rbp),%rdx
          87ef5e: 4c 89 f6 mov %r14,%rsi
          87ef61: 44 89 ff mov %r15d,%edi
          87ef64: e8 27 fc 9f ff callq 27eb90 <write@plt>
          87ef69: 48 89 c3 mov %rax,%rbx
          87ef6c: 83 f8 ff cmp $0xffffffff,%eax
          87ef6f: 74 df je 87ef50 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xb0>
          87ef71: 49 8b 45 00 mov 0x0(%r13),%rax
          87ef75: 49 01 5d 40 add %rbx,0x40(%r13)
          87ef79: 49 89 45 08 mov %rax,0x8(%r13)
          87ef7d: e9 4f ff ff ff jmpq 87eed1 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x31>
        ```

        ```c++
        template <typename Adapter, typename AP>
        void StreamWriterHost<Adapter, AP>::write_unbuffered(const void* buf, size_t len) {
          this->flush();
          assert(0 == this->used_offset(), "can only seek from beginning");
          while (len > 0) {
            const int n = MIN2<int>((int)len, INT_MAX);
            _stream_pos += os::write(_fd, buf, n);
            len -= n;
          }
        }
        ```

        https://github.com/openjdk/jdk/blob/jdk-13%2B9/src/hotspot/share/jfr/writers/jfrStreamWriterHost.inline.hpp#L108-L117

        Here are what we found:

        * write(3, 0x00007f2838042290, 0) => 0 is invoked endlessly.
        * The original `len` passed to write_unbuffered() was "negative" (`0xffffffffffffea74`). This value was recovered from `unflashed_size` at https://github.com/openjdk/jdk/blob/jdk-13%2B9/src/hotspot/share/jfr/recorder/storage/jfrStorageUtils.inline.hpp#L46. Fwiw, `t->pos()` was `0x7f2838040d04` and `current_top` was `0x7f2838042290`.

        While we were unsuccessful at root-causing why `len` went negative, we identified a bug in write_unbuffered() that caused an infinite loop:
        the loop in the function never ends if `len` has larger unsigned value than 0xffffffff, because `n = MIN2<int>((int)len, INT_MAX)` only sees the lower 32 bits of `len` and `n` becomes 0 when `len` still has non-zero upper bits.

        write_unbuffered() also seems to have other issues that could potentially corrupt JFR recording files:

        * os::write() can fail with -1.
        * `len` should be decremented by the *actual* number of bytes written by os::write() which may be less than `n`, to avoid data loss.
        * The current implementation could potentially write the same data repeatedly. `buf` must be advanced each time by the number of bytes written.

        So, the function probably should look something like (I didn't test this code):

        ```c++
        template <typename Adapter, typename AP>
        void StreamWriterHost<Adapter, AP>::write_unbuffered(void* buf, size_t len) {
          this->flush();
          assert(0 == this->used_offset(), "can only seek from beginning");
          while (len > 0) {
            // As os::write() only accepts unsigned int (rather than size_t), we need to split the input into
            // INT_MAX sized chunks. The size of each chunk `n` is `len` clamped to INT_MAX. We can't truncate
            // `len` before comparing it to INT_MAX.
            const unsigned int n = (unsigned int) MIN2<size_t>(len, INT_MAX);
            size_t nr_written = os::write(_fd, buf, n);
            if (nr_written == -1) {
              // TODO: error handling
              return;
            }
            _stream_pos += nr_written;
            len -= nr_written;
            buf += nr_written;
          }
        }
        ```

        Even if write_unbuffered() is fixed, "negative" `len` (i.e. length larger than actual `buf`) would still corrupt JFR recordings until os::write() eventually fails with -1 (errno = EFAULT) or some other errors.
        It might be good to add an assertion `t->pos() >= current_top` before calling write_unbuffered() to avoid this situation (especially, writing out a random memory location to a file is not good in terms of security).

        https://github.com/openjdk/jdk/blob/jdk-13%2B9/src/hotspot/share/jfr/recorder/storage/jfrStorageUtils.inline.hpp#L46


        FREQUENCY : rarely


        Attachments

          Issue Links

            Activity

              People

                mgronlun Markus Grönlund
                webbuggrp Webbug Group
                Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved: