I looked into why almost all mark cycles see non-zero "mark completions". In other words, we almost always have some amount of mark work left to handle in the mark end pause. It turns out that worker threads don't flush their mark stacks in ZMarkConcurrentRootsTask::work(), which means they can hide work (in their thread local mark stacks) until those stacks are finally flushed out in ZMark::try_end(). The reason work can be hidden is that the set of worker threads executing ZMarkConcurrentRootsTask is not necessarily the same set of worker threads executing ZMarkTask. As a result, the mark end pause often becomes longer than it otherwise would have.
After fixing this, I did some tests with Dacapo, which shows the following improvement:
Before: Mark End Pause (avg/max): 0.391 / 1.142 ms
After: Mark End Pause (avg/max): 0.130 / 0.294 ms
After fixing this, I did some tests with Dacapo, which shows the following improvement:
Before: Mark End Pause (avg/max): 0.391 / 1.142 ms
After: Mark End Pause (avg/max): 0.130 / 0.294 ms