*Problem Description*
It's possible for multiple processes to race writing chunks to the same snapshot file path. This problem was repeatedly encountered by an engineer running various tests with JFR enabled.
Chunks in such snapshots may be out of chronological order due to the race.
In our case, this is what was happening:
- ProcessA is the process we expect to create the snapshot and ProcessB is unexpectedly racing with JFR in ProcessA.
- While ProcessA is recording, ProcessB starts and stops a very short JFR recording and writes the chunks to the given snapshot file path.
- At the same time, ProcessA is wrapping up recording.
- After ProcessA is finished recording, it writes all the chunks to the same filename. This data is appended to the file after the ProcessB's chunks.
- This snapshot is bad, but JMC nor the JFR consumer code will not indicate there is a problem
*Steps to Reproduce*
I was able to reproduce this scenario with a few simple steps.To reproduce, open 2 terminals. Run java -XX:+FlightRecorder -XX:StartFlightRecording=dumponexit=true,filename=rec.jfr SimpleBusyWait in terminal_A (https://gist.github.com/roberttoyonaga/44d24612b88d6844213cf1e5252c23b1). Wait a few seconds, then run it again in the second terminal_B. Use ctrl+c to stop terminal_B (this writes ProcessB's chunks to the file path). Then use ctrl+C on terminal_A to finish the rec.jfr shapshot.
If you inspect the chunk headers, you'll notice that the first chunk has startNanos after the second chunk.
You can also use this app (https://gist.github.com/roberttoyonaga/64e4f82da8e14ea7d75684e4cf6e2267) to reproduce the problem all in a single process.
This case is interesting because you get repeated chunks: chunk_2, chunk_1, chunk_2, chunk_3. Chunk_2's true position should be between chunk_1 and chunk_3, but we also see it appear at the very front, which is wrong.
*Potential Fix*
It could be argued that this problem is due to user error. But I think it would be helpful to at least have a warning logged, since this problem is relatively easy to reproduce and was spotted in the wild.
One possible solution is to use something like a sentinel file to indicate that a dump destination is in use. We could just generate something like original_filename_lock.jfr in the dump directory. This file could exist as long as the recording is alive. Proactively preventing new recordings from using the same filename. If another JFR process wants to use a filename that is already in use, it can proactively warn the user before starting the recording/dump so that the user can pick a different path.
It's possible for multiple processes to race writing chunks to the same snapshot file path. This problem was repeatedly encountered by an engineer running various tests with JFR enabled.
Chunks in such snapshots may be out of chronological order due to the race.
In our case, this is what was happening:
- ProcessA is the process we expect to create the snapshot and ProcessB is unexpectedly racing with JFR in ProcessA.
- While ProcessA is recording, ProcessB starts and stops a very short JFR recording and writes the chunks to the given snapshot file path.
- At the same time, ProcessA is wrapping up recording.
- After ProcessA is finished recording, it writes all the chunks to the same filename. This data is appended to the file after the ProcessB's chunks.
- This snapshot is bad, but JMC nor the JFR consumer code will not indicate there is a problem
*Steps to Reproduce*
I was able to reproduce this scenario with a few simple steps.To reproduce, open 2 terminals. Run java -XX:+FlightRecorder -XX:StartFlightRecording=dumponexit=true,filename=rec.jfr SimpleBusyWait in terminal_A (https://gist.github.com/roberttoyonaga/44d24612b88d6844213cf1e5252c23b1). Wait a few seconds, then run it again in the second terminal_B. Use ctrl+c to stop terminal_B (this writes ProcessB's chunks to the file path). Then use ctrl+C on terminal_A to finish the rec.jfr shapshot.
If you inspect the chunk headers, you'll notice that the first chunk has startNanos after the second chunk.
You can also use this app (https://gist.github.com/roberttoyonaga/64e4f82da8e14ea7d75684e4cf6e2267) to reproduce the problem all in a single process.
This case is interesting because you get repeated chunks: chunk_2, chunk_1, chunk_2, chunk_3. Chunk_2's true position should be between chunk_1 and chunk_3, but we also see it appear at the very front, which is wrong.
*Potential Fix*
It could be argued that this problem is due to user error. But I think it would be helpful to at least have a warning logged, since this problem is relatively easy to reproduce and was spotted in the wild.
One possible solution is to use something like a sentinel file to indicate that a dump destination is in use. We could just generate something like original_filename_lock.jfr in the dump directory. This file could exist as long as the recording is alive. Proactively preventing new recordings from using the same filename. If another JFR process wants to use a filename that is already in use, it can proactively warn the user before starting the recording/dump so that the user can pick a different path.