Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8336640

Shenandoah: Parallel worker use in parallel_heap_region_iterate

    XMLWordPrintable

Details

    • gc
    • b09

    Description

      Shenandoah init mark is supposed to be very fast, on the order of a few hundreds microseconds. We do most of the work right in the VM thread that executes the safepoint. Yet, we have a block here that involves workers:
      https://github.com/openjdk/jdk/blob/d41d2a7a82cb6eff17396717e2e14139ad8179ba/src/hotspot/share/gc/shenandoah/shenandoahConcurrentGC.cpp#L555-L559

      It goes for parallel walk when the number of regions is 1024 (see ShenandoahParallelRegionStride), which is below the usual Shenandoah target of 2048 regions. Which means we are likely always going into that path.

      It might cause some trouble, if the number of parallel GC workers is high: we wake up lots of GC threads without having most them do any useful work:

      [info ][gc,start ] GC(163) Pause Init Mark (unload classes)
      [info ][gc,task ] GC(163) Using 16 of 16 workers for init marking
      [info ][gc ] GC(163) Pause Init Mark (unload classes) 0.116ms
      [info ][safepoint ] Safepoint "ShenandoahInitMark", Time since last: 10717617218 ns, Reaching safepoint: 157434 ns, Cleanup: 27282 ns, At safepoint: 202251 ns, Total: 386967 ns

      We need to see if: a) this is actually a problem; b) default ShenandoahParallelRegionStride is too low; c) whether we should limit the number of active worker around that block by `num_regions() / stride`; d) whether we should just ditch this code and do a single-threaded walk always.


      Not limited to init mark, parallel_heap_region_iterate is used by 4 others GC phases to apply lightweight operation on heap regions, if possible/needed, we should optimize parallel_heap_region_iterate which generally benefits all the 5 places using parallel_heap_region_iterate to walk and apply operation on heap regions.


      Assuming the overhead to orchestrate worker threads for parallel interaction is `n`, the cost to process 1024 heap region is `m`(assuming total cost is linear in single thread), we could test and collect the value of `n` and `m` them calculate the threshold, below the threshold simply use single thread, otherwise use parallel walk. Threshold should be roughly `(n/m + 1) * 1024`

      Attachments

        Issue Links

          Activity

            People

              xpeng Xiaolong Peng
              shade Aleksey Shipilev
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: