Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8017163

G1: Refactor remembered sets



    • Enhancement
    • Resolution: Fixed
    • P2
    • 18
    • hs24, hs25
    • hotspot
    • gc
    • b03


      The current G1 remembered set implementation has been designed for use cases and Java heaps and applications from 20 years ago.

      Over time many problems with performance and in particular memory usage have been observed:

      * adding elements to the lowest tier data structure takes a per-remembered set global lock. Measurements have shown that the applications can wait thousands of seconds acquiring these locks. While the affected threads are in most cases refinement threads so does not directly affect the application, it can still affect the ability of G1 to meet some goals needed for keeping pause times (i.e. amount of cards from the refinement buffers to be merged into the card table and then scanned during gc).

      * there is a substantial memory overhead for managing the data structures: examples are
          * using separate (hash) tables for the three different types of card containers
          * there is significant unnecessary preallocation of memory for some of the card set containers
          * Containers store redundant information

      * inflexibility when reusing memory: in the current implementation the different containers use different approaches to manage memory. Most use the C heap directly, some the C heap with some internal global memory pool. This in practice makes it very difficult to implement anything other than giving back memory in the collection pause. The corresponding "Free Collection Set" pause can take a significant amount of time because of that.
      Also memory reuse is limited and preallocating arenas is limited (or would have to be reimplemented multiple times), stressing the C heap allocator.

      * inability to support additional use cases: over time interesting ideas (e.g. JDK-8058803) came up for improving performance of remembered set management. Mostly due to redundant information everywhere and completely different handling of various aspects in the containers it is in practice impossible to implement these.

      * (partial) inability to give back memory to the OS. While some of the containers use the C heap allocator, and so in some way give back memory, these implementations and handling is different for every container.

      * the existing granularity of containers are unbalanced: currently there exist three tiers: "sparse", "fine" and "full". Sparse is an array of cards ranging in the hundreds maybe, "fine" is a bitmap covering a whole region and full is a bit indicating that that region should be scanned completely during GC.

      The problems are that there is nothing between "no card at all" and "sparse" and in particular the difference between the capability to hold entries of "sparse" and "fine". I.e. memory usage difference when exceeding a "sparse" array (holding 128 entries at 32M regions, taking ~256 bytes) to fine that is able to hold 65k entries using 8kB is significant.
      For these reason there is even a dedicated option to stop allocating more "fine" containers and just give up and use "full" instead to avoid excessive memory usage. With extremely bad consequences in pause times.

      Over time some of these issues have been fixed or in many cases band-aided, and some of these fixes and ideas were the result of working on this change (e.g. JDK-8262185, JDK-8233919, JDK-8213108).

      This change is effectively a rewrite of the Java heap card based part of a region's remembered set.

      This initial fully working change can be roughly described with the following properties:

      * use a single ConcurrentHashTable for the card containers of a given region. The container in use replaced (coarsened) on the fly within the CHT node, completely lock-free. This implements JDK-6949259.

      * memory for a given region's remembered set for all containers (and the CHT nodes) is backed by per container type and per remembered set arena style bump-pointer allocation buffers. In this change, in the pause, memory is given back to free lists only. The implementation gives back memory to the OS concurrently to the application. Memory is still managed using the C heap memory manager though, but abstracted away and could be replaced by manual page memory management.

      * there are now four different container types and one meta-container type. These four actual containers are:
        * inline pointer: the change store a few (3-5) cards in the CHT node directly and uses no extra memory.
        * array of cards: similar to the "sparse" container, an array of cards with a configurable amount of entries. However bulk allocation of memory is now managed at a lower level so there is much less waste.
        * bitmap: similar to "fine", a bitmap spanning a (sub-)range of memory
        * full: same as full, indicating for a (sub-)range of memory that all cards are to be looked at during scan. Similar to inline pointers, this uses no extra memory.
        * howl: the Howl container subdivides a given memory range into subranges where any of the other containers describing that sub-range of the heap may be stored in. This is somewhat similar to the idea suggested in JDK-8048504.

      * care has been taken to minimize container memory usage, e.g. by not adding redundant information there and in general carefully specify them. They have been designed with future enhancements in mind.

      In some benchmarks (where there is significant remembered set memory usage) we are seeing memory reduction to 25% of JDK 16 levels with this change. Garbage collection times are at most as long or shorter than before; most changes affecting that have been extracted earlier. Individiual affected phases are generally shorter.


        Issue Links



              tschatzl Thomas Schatzl
              brutisso Bengt Rutisson (Inactive)
              0 Vote for this issue
              9 Start watching this issue