Description
Sampling stacks from safepoints is a very safe way of sampling. All components of the JVM have been designed to be able to walk stacks from safepoints, and by walking only frames that are walkable, we are immune to trouble caused by guessed methods that quack like methods from a genuine stack trace and walk like methods from a genuine stack trace, but potentially explode later on due to use-after-free.
The classic problem with safepoint based sampling, is safepoint bias. Essentially, the trouble is that samples come from points where we poll from safepoints, which might be many bytecodes away from where we were really spending statistically significant time.
I propose a hybrid safepoint and signal based solution. The idea is that we still shoot a signal at a samplee thread. In that signal, we only record the SP and PC of the thread. Then this is enqueued to be sampled on that thread, in a subsequent safepoint pollsite. When we get to the subsequent safepoint pollsite, we check if the PC is from an nmethod. If it is, we can recreate the exact stacktrace that we would normally have reported from the signal handler, from the safepoint pollsite instead. When we hit compiled methods, we get the benefits of signal based accuracy, combined with the fundamental safety of having the entire stacktrace be walked from a safe walkable point in the JVM. When the sampeld PC isn't coming from an nmethod, I propose we perform the stack trace completely from the safe point. As for any safepoint bias from the interpreter, it's rather straight forward to simply poll for safepoints in the dispatch loop of the interpreter, which eliminates the safepoint bias as a problem, from interpreted code. The original proposed patch for thread-local handshakes did exactly that and it worked absolutely fine.
I have a prototype for the suggested changes available here: https://github.com/fisk/jdk/tree/jfr_safe_trace_v1
The classic problem with safepoint based sampling, is safepoint bias. Essentially, the trouble is that samples come from points where we poll from safepoints, which might be many bytecodes away from where we were really spending statistically significant time.
I propose a hybrid safepoint and signal based solution. The idea is that we still shoot a signal at a samplee thread. In that signal, we only record the SP and PC of the thread. Then this is enqueued to be sampled on that thread, in a subsequent safepoint pollsite. When we get to the subsequent safepoint pollsite, we check if the PC is from an nmethod. If it is, we can recreate the exact stacktrace that we would normally have reported from the signal handler, from the safepoint pollsite instead. When we hit compiled methods, we get the benefits of signal based accuracy, combined with the fundamental safety of having the entire stacktrace be walked from a safe walkable point in the JVM. When the sampeld PC isn't coming from an nmethod, I propose we perform the stack trace completely from the safe point. As for any safepoint bias from the interpreter, it's rather straight forward to simply poll for safepoints in the dispatch loop of the interpreter, which eliminates the safepoint bias as a problem, from interpreted code. The original proposed patch for thread-local handshakes did exactly that and it worked absolutely fine.
I have a prototype for the suggested changes available here: https://github.com/fisk/jdk/tree/jfr_safe_trace_v1
Attachments
Issue Links
- blocks
-
JDK-8316239 JFR: fatal error: refcount has gone to zero
- Open
-
JDK-8302350 JfrThreadSampler failed with "assert((is_native() && bci == 0) || (!is_native() && 0 <= bci && bci < code_size())) failed: illegal bci: 0 for non-native method"
- Open