Details
-
Bug
-
Resolution: Fixed
-
P4
-
13
-
b14
-
x86_64
-
linux
Backports
Issue | Fix Version | Assignee | Priority | Status | Resolution | Resolved In Build |
---|---|---|---|---|---|---|
JDK-8265650 | 15.0.4 | Ekaterina Vergizova | P4 | Resolved | Fixed | b01 |
JDK-8258793 | 13.0.6 | Ekaterina Vergizova | P4 | Resolved | Fixed | b04 |
JDK-8255195 | 11.0.10-oracle | Harold Seigel | P4 | Resolved | Fixed | b03 |
JDK-8255605 | 11.0.10 | Markus Grönlund | P4 | Resolved | Fixed | b01 |
Description
$ /usr/lib/jvm/jdk-13/bin/java -version
openjdk version "13.0.1" 2019-10-15
OpenJDK Runtime Environment (build 13.0.1+9)
OpenJDK 64-Bit Server VM (build 13.0.1+9, mixed mode, sharing)
$ uname -a
Linux <hostname> 5.1.5-050105-generic #201905251333 SMP Wed Jun 19 09:49:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
While the OpenJDK we use is somewhat outdated, the write_unbuffered() code is mostly unchanged and still seem to have the same issue.
https://github.com/openjdk/jdk/blob/51c8ac362e051c5e1978e1d696f537dbc0b29e03/src/hotspot/share/jfr/writers/jfrStreamWriterHost.inline.hpp#L112-L116
A DESCRIPTION OF THE PROBLEM :
We encountered an issue where JVM becomes completely unresponsive. Log indicates that the all threads except "JFR Recorder Thread" are paused in a safepoint, waiting for the JFR thread to reach the safepoint.
This is a second time we saw this issue. The first time was in January 2020.
```
[2020-08-16T14:32:56.366+0900][warning][safepoint ]
[2020-08-16T14:32:56.366+0900][warning][safepoint ] # SafepointSynchronize::begin: Timeout detected:
[2020-08-16T14:32:56.366+0900][warning][safepoint ] # SafepointSynchronize::begin: Timed out while spinning to reach a safepoint.
[2020-08-16T14:32:56.366+0900][warning][safepoint ] # SafepointSynchronize::begin: Threads which did not reach the safepoint:
[2020-08-16T14:32:56.366+0900][warning][safepoint ] # "JFR Recorder Thread" #16 daemon prio=5 os_prio=0 cpu=5866422.36ms elapsed=6491407.13s tid=0x00007f285ee519b0 nid=0x465d runnable [0x0000000000000000]
[2020-08-16T14:32:56.366+0900][warning][safepoint ] java.lang.Thread.State: RUNNABLE
[2020-08-16T14:32:56.366+0900][warning][safepoint ]
[2020-08-16T14:32:56.366+0900][warning][safepoint ] #
SafepointSynchronize::begin: (End of list)
```
The JFR Recorder Thread was using 100% CPU. Here's the backtrace.
```
(gdb) bt
#0 0x00007f28637782b7 in write () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 0x00007f2862790f19 in StreamWriterHost<MallocAdapter<1048576ul>, JfrCHeapObj>::write_unbuffered(void const*, unsigned long) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#2 0x00007f28627908c2 in JfrStorage::write() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#3 0x00007f2862783a93 in JfrRecorderService::pre_safepoint_write() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#4 0x00007f2862784696 in JfrRecorderService::write() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#5 0x00007f2862785c8b in JfrRecorderService::rotate(int) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#6 0x00007f28627865b2 in recorderthread_entry(JavaThread*, Thread*) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#7 0x00007f2862c76382 in JavaThread::thread_main_inner() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#8 0x00007f2862c7afed in Thread::call_run() () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#9 0x00007f2862acd867 in thread_native_entry(Thread*) () from /usr/lib/jvm/jdk-13.0.1/lib/server/libjvm.so
#10 0x00007f286376e6db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#11 0x00007f286329388f in clone () from /lib/x86_64-linux-gnu/libc.so.6
```
We confirmed that the thread is running [87eee0, 87ef28] (marked by \*) endlessly in the assembly list (with comments) below. The corresponding source code is also listed below.
```asm
000000000087eea0 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm>:
87eea0: 55 push %rbp
87eea1: 48 89 e5 mov %rsp,%rbp
87eea4: 41 57 push %r15
87eea6: 41 56 push %r14
87eea8: 41 55 push %r13
87eeaa: 49 89 fd mov %rdi,%r13
87eead: 41 54 push %r12
87eeaf: 49 89 d4 mov %rdx,%r12
87eeb2: 53 push %rbx
87eeb3: 48 83 ec 18 sub $0x18,%rsp
87eeb7: 44 8b 7f 48 mov 0x48(%rdi),%r15d
87eebb: 48 89 75 c8 mov %rsi,-0x38(%rbp)
87eebf: 41 83 ff ff cmp $0xffffffff,%r15d
87eec3: 74 0c je 87eed1 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x31>
87eec5: 4c 8b 37 mov (%rdi),%r14
87eec8: 48 8b 47 08 mov 0x8(%rdi),%rax
87eecc: 4c 29 f0 sub %r14,%rax
87eecf: 75 6f jne 87ef40 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xa0>
87eed1: 4d 85 e4 test %r12,%r12
87eed4: 74 54 je 87ef2a <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x8a>
87eed6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
87eedd: 00 00 00
* 87eee0: 44 89 e3 mov %r12d,%ebx ; p/x $r12 => 0xffffffff00000000 ; p/x $r12d => 0x0 ; n = len
* 87eee3: 41 81 fc ff ff ff 7f cmp $0x7fffffff,%r12d
* 87eeea: b8 ff ff ff 7f mov $0x7fffffff,%eax
* 87eeef: 45 8b 75 48 mov 0x48(%r13),%r14d ; p $r14d => 3 ; this->_fd (/tmp/2020_06_02_11_22_49_17357/2020_08_16_14_31_56.part)
* 87eef3: 48 0f 43 d8 cmovae %rax,%rbx ; n = INT_MAX if INT_MAX < (int) len ; p $rbx => 0
* 87eef7: eb 11 jmp 87ef0a <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x6a>
* 87eef9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
* 87ef00: e8 7b fb 9f ff callq 27ea80 <__errno_location@plt>
* 87ef05: 83 38 04 cmpl $0x4,(%rax)
* 87ef08: 75 17 jne 87ef21 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x81>
* 87ef0a: 48 8b 75 c8 mov -0x38(%rbp),%rsi ; x/g $rbp-0x38 => 0x7f2637ffeb18: 0x00007f2838042290 ; buf of write(2)
* 87ef0e: 48 89 da mov %rbx,%rdx ; p $rdx => 0 ; count of write(2)
* 87ef11: 44 89 f7 mov %r14d,%edi
; write(3, 0x00007f2838042290, 0) #=> 0 has been executed repeatedly.
* 87ef14: e8 77 fc 9f ff callq 27eb90 <write@plt>
* 87ef19: 49 89 c7 mov %rax,%r15 ; p $rax => 0 ; ret of write(2)
* 87ef1c: 83 f8 ff cmp $0xffffffff,%eax
* 87ef1f: 74 df je 87ef00 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x60>
* 87ef21: 4d 01 7d 40 add %r15,0x40(%r13)
* 87ef25: 49 29 dc sub %rbx,%r12 ; p/x $r12 => 0xffffffff00000000 ; len -= n
; We keep jumping back to 87eee0 as the upper 32 bits of %r12 (len) never reaches 0.
* 87ef28: 75 b6 jne 87eee0 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x40>
87ef2a: 48 83 c4 18 add $0x18,%rsp
87ef2e: 5b pop %rbx
87ef2f: 41 5c pop %r12
87ef31: 41 5d pop %r13
87ef33: 41 5e pop %r14
87ef35: 41 5f pop %r15
87ef37: 5d pop %rbp
87ef38: c3 retq
87ef39: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
87ef40: 89 c0 mov %eax,%eax
87ef42: 48 89 45 c0 mov %rax,-0x40(%rbp)
87ef46: eb 12 jmp 87ef5a <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xba>
87ef48: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
87ef4f: 00
87ef50: e8 2b fb 9f ff callq 27ea80 <__errno_location@plt>
87ef55: 83 38 04 cmpl $0x4,(%rax)
87ef58: 75 17 jne 87ef71 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xd1>
87ef5a: 48 8b 55 c0 mov -0x40(%rbp),%rdx
87ef5e: 4c 89 f6 mov %r14,%rsi
87ef61: 44 89 ff mov %r15d,%edi
87ef64: e8 27 fc 9f ff callq 27eb90 <write@plt>
87ef69: 48 89 c3 mov %rax,%rbx
87ef6c: 83 f8 ff cmp $0xffffffff,%eax
87ef6f: 74 df je 87ef50 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0xb0>
87ef71: 49 8b 45 00 mov 0x0(%r13),%rax
87ef75: 49 01 5d 40 add %rbx,0x40(%r13)
87ef79: 49 89 45 08 mov %rax,0x8(%r13)
87ef7d: e9 4f ff ff ff jmpq 87eed1 <_ZN16StreamWriterHostI13MallocAdapterILm1048576EE11JfrCHeapObjE16write_unbufferedEPKvm+0x31>
```
```c++
template <typename Adapter, typename AP>
void StreamWriterHost<Adapter, AP>::write_unbuffered(const void* buf, size_t len) {
this->flush();
assert(0 == this->used_offset(), "can only seek from beginning");
while (len > 0) {
const int n = MIN2<int>((int)len, INT_MAX);
_stream_pos += os::write(_fd, buf, n);
len -= n;
}
}
```
https://github.com/openjdk/jdk/blob/jdk-13%2B9/src/hotspot/share/jfr/writers/jfrStreamWriterHost.inline.hpp#L108-L117
Here are what we found:
* write(3, 0x00007f2838042290, 0) => 0 is invoked endlessly.
* The original `len` passed to write_unbuffered() was "negative" (`0xffffffffffffea74`). This value was recovered from `unflashed_size` at https://github.com/openjdk/jdk/blob/jdk-13%2B9/src/hotspot/share/jfr/recorder/storage/jfrStorageUtils.inline.hpp#L46. Fwiw, `t->pos()` was `0x7f2838040d04` and `current_top` was `0x7f2838042290`.
While we were unsuccessful at root-causing why `len` went negative, we identified a bug in write_unbuffered() that caused an infinite loop:
the loop in the function never ends if `len` has larger unsigned value than 0xffffffff, because `n = MIN2<int>((int)len, INT_MAX)` only sees the lower 32 bits of `len` and `n` becomes 0 when `len` still has non-zero upper bits.
write_unbuffered() also seems to have other issues that could potentially corrupt JFR recording files:
* os::write() can fail with -1.
* `len` should be decremented by the *actual* number of bytes written by os::write() which may be less than `n`, to avoid data loss.
* The current implementation could potentially write the same data repeatedly. `buf` must be advanced each time by the number of bytes written.
So, the function probably should look something like (I didn't test this code):
```c++
template <typename Adapter, typename AP>
void StreamWriterHost<Adapter, AP>::write_unbuffered(void* buf, size_t len) {
this->flush();
assert(0 == this->used_offset(), "can only seek from beginning");
while (len > 0) {
// As os::write() only accepts unsigned int (rather than size_t), we need to split the input into
// INT_MAX sized chunks. The size of each chunk `n` is `len` clamped to INT_MAX. We can't truncate
// `len` before comparing it to INT_MAX.
const unsigned int n = (unsigned int) MIN2<size_t>(len, INT_MAX);
size_t nr_written = os::write(_fd, buf, n);
if (nr_written == -1) {
// TODO: error handling
return;
}
_stream_pos += nr_written;
len -= nr_written;
buf += nr_written;
}
}
```
Even if write_unbuffered() is fixed, "negative" `len` (i.e. length larger than actual `buf`) would still corrupt JFR recordings until os::write() eventually fails with -1 (errno = EFAULT) or some other errors.
It might be good to add an assertion `t->pos() >= current_top` before calling write_unbuffered() to avoid this situation (especially, writing out a random memory location to a file is not good in terms of security).
https://github.com/openjdk/jdk/blob/jdk-13%2B9/src/hotspot/share/jfr/recorder/storage/jfrStorageUtils.inline.hpp#L46
FREQUENCY : rarely
Attachments
Issue Links
- backported by
-
JDK-8255195 JFR: StreamWriterHost::write_unbuffered() stucks in an infinite loop OpenJDK (build 13.0.1+9)
- Resolved
-
JDK-8255605 JFR: StreamWriterHost::write_unbuffered() stucks in an infinite loop OpenJDK (build 13.0.1+9)
- Resolved
-
JDK-8258793 JFR: StreamWriterHost::write_unbuffered() stucks in an infinite loop OpenJDK (build 13.0.1+9)
- Resolved
-
JDK-8265650 JFR: StreamWriterHost::write_unbuffered() stucks in an infinite loop OpenJDK (build 13.0.1+9)
- Resolved
- links to
-
Commit openjdk/jdk13u-dev/ead24ad7
-
Commit openjdk/jdk15u-dev/6d6a0fef
-
Review openjdk/jdk13u-dev/66
-
Review openjdk/jdk15u-dev/30